Scientific Data

Scientific Data
Name	Scientific Data
Field	National Science Foundation; Royal Society
Type	Empirical information
Introduced	Royal Society (modern scientific method)
Discipline	Peer review; Open science

Contents

Definition and Characteristics
Types and Formats
Collection and Measurement Methods
Management and Curation
Analysis and Visualization
Sharing, Access, and Reuse
Ethical, Legal, and Reproducibility Issues

Scientific Data Scientific data are empirical observations and measurements generated by systematic inquiry across experimental, observational, and computational methods. They underpin verification of hypotheses, models, and theories within institutions such as the National Aeronautics and Space Administration, CERN, and the National Institutes of Health, and they are central to initiatives promoted by organizations like the European Commission and the Wellcome Trust. Data stewardship informs scientific publishing practices in venues such as the Nature family and the Science family.

Definition and Characteristics

Scientific data are discrete records—numeric, textual, image, or signal—that represent properties of phenomena collected under protocols defined by entities like the World Health Organization or projects such as the Human Genome Project. Typical characteristics include provenance documented through standards from bodies like the International Organization for Standardization and metadata schemas influenced by the Dublin Core and the FAIR Guiding Principles. Quality attributes referenced by the National Academies of Sciences, Engineering, and Medicine include accuracy, precision, resolution, completeness, and representativeness. Reproducibility expectations are shaped by practices advocated by the U.S. Office of Science and Technology Policy and agencies funding research such as the Horizon Europe programme.

Types and Formats

Scientific data span modalities: numerical arrays from instruments in facilities like Large Hadron Collider, sequence reads from projects such as the Human Microbiome Project, imagery from observatories including the Hubble Space Telescope, geospatial rasters and vectors used by the United States Geological Survey, and time-series logs from experiments at institutions like the Jet Propulsion Laboratory. File formats range from plain text and comma-separated values to domain-specific encodings such as FITS for astronomy, FASTQ for genomics, NetCDF and HDF5 for multidimensional arrays, and DICOM for clinical imaging. Metadata and ontologies—developed by consortia like the Gene Ontology Consortium and the Open Geospatial Consortium—provide structure for interoperability and automated discovery.

Collection and Measurement Methods

Data acquisition methods include controlled laboratory experiments exemplified by protocols used at the Salk Institute, field campaigns organized by the Intergovernmental Panel on Climate Change, remote sensing from satellites deployed by the European Space Agency, and high-throughput sequencing workflows pioneered in projects like the 1000 Genomes Project. Instrument calibration and traceability to standards such as those from the National Institute of Standards and Technology are essential for metrology. Sampling designs derive from statistical frameworks established by scholars associated with institutions like Princeton University and Stanford University. Automation and sensors developed by companies and laboratories collaborating with entities like MIT and Lawrence Berkeley National Laboratory enable real-time telemetry and Internet of Things deployments.

Management and Curation

Data management plans aligned with funder mandates from bodies such as the National Science Foundation and repositories like Dryad or Zenodo guide lifecycle practices: ingest, validation, indexing, preservation, and disposal. Curation roles are performed by specialists in libraries and archives, including staff at the Library of Congress and institutional repositories at universities like Harvard University and University of Cambridge. Persistent identifiers such as Digital Object Identifiers and ORCID IDs link datasets to publications in outlets like PLOS and Proceedings of the National Academy of Sciences. Long-term preservation uses strategies informed by standards from the Open Archival Information System model and by communities such as the Research Data Alliance.

Analysis and Visualization

Analytical methods employ statistical and computational tools developed in environments like R and Python, with libraries and frameworks originating from research groups at University of California, Berkeley and Carnegie Mellon University. Machine learning and simulation models are implemented on infrastructures such as XSEDE and cloud platforms provided by corporations that collaborate with academia and labs like Argonne National Laboratory. Visualization techniques, informed by work from institutions such as the IEEE Visualization community and showcased in conferences like SIGGRAPH, translate complexity into interpretable figures, maps, and interactive dashboards used by projects like Gapminder.

Open data movements promoted by entities including the European Open Science Cloud and funders such as the Wellcome Trust encourage deposition in domain repositories like GenBank, PANGAEA, and the Protein Data Bank. Licensing frameworks from organizations such as Creative Commons and policy instruments from governments like the United Kingdom define terms for reuse. Citation practices linking datasets in journals like Scientific Reports increase credit through mechanisms proposed by groups such as the DataCite consortium.

Ethical, Legal, and Reproducibility Issues

Ethical governance addresses concerns in human-subjects data overseen by review boards modeled after the Belmont Report principles and regulations such as the General Data Protection Regulation and legislation like the Health Insurance Portability and Accountability Act. Legal frameworks influence data sharing across borders involving agreements similar to those negotiated by the World Trade Organization and multilateral research programs coordinated by the United Nations Educational, Scientific and Cultural Organization. Reproducibility crises documented in studies affiliated with journals like Nature have prompted replication initiatives supported by organizations like the Center for Open Science and policy responses from agencies including the National Institutes of Health. Robust stewardship, transparent methods, and community standards developed by consortia such as the Committee on Data (CODATA) aim to mitigate risks to validity and trust in the scientific enterprise.

Category:Scientific data