Data Carpentry — LLMpedia

Data Carpentry
Name	Data Carpentry
Founded	2014
Type	Non-profit organization
Purpose	Training researchers in data skills
Parent organization	The Carpentries

Contents

Overview
History and development
Curriculum and workshops
Pedagogy and learning resources
Community and governance
Impact and adoption

Data Carpentry

Data Carpentry is a community-driven initiative that provides domain-specific training in data management, analysis, and reproducible workflows for researchers and practitioners. It delivers modular lesson materials and hands-on workshops aimed at improving practice among participants from fields such as ecology, genomics, social science, and archaeology. The organization is affiliated with broader training initiatives and collaborates with universities, national laboratories, foundations, and professional societies to scale data literacy.

Overview

Data Carpentry develops lesson modules and workshop programs that integrate practical instruction in file formats, data cleaning, data organization, scripting languages, and version control with use cases drawn from University of California, Berkeley, University of Washington, Stanford University, University of Cambridge, Massachusetts Institute of Technology, Harvard University, Princeton University, Yale University, Columbia University, University of Oxford, Imperial College London, California Institute of Technology, National Institutes of Health, National Science Foundation, Howard Hughes Medical Institute, Wellcome Trust, European Molecular Biology Laboratory, Los Alamos National Laboratory, Lawrence Berkeley National Laboratory, Argonne National Laboratory, Oak Ridge National Laboratory, CERN, NASA, European Space Agency, Smithsonian Institution, American Association for the Advancement of Science, American Geophysical Union, Ecological Society of America, Society for Industrial and Applied Mathematics, Royal Society, British Ecological Society, PLOS, Nature Research and Science (journal) examples. Lessons commonly use tools such as R (programming language), Python (programming language), SQL, Git, and OpenRefine in context-specific workflows drawn from datasets produced by researchers at institutions like Smith College, University of British Columbia, University of Toronto, McGill University, University of Melbourne, University of Sydney, Australian National University, University of Auckland, ETH Zurich, University of Geneva, Max Planck Society, Karolinska Institute, University of Tokyo, Peking University, Tsinghua University, Indian Institute of Science, IISc Bangalore, University of São Paulo, Universidade Estadual de Campinas, Federal University of Rio de Janeiro, and University of Cape Town.

History and development

Data Carpentry originated as a response to recognized deficits in hands-on data skills among researchers, emerging alongside parallel initiatives such as Software Carpentry and later formalizing under the umbrella of The Carpentries. The initiative grew during the 2010s with seed support from organizations including the National Science Foundation, Mozilla Foundation, Alfred P. Sloan Foundation, Gordon and Betty Moore Foundation, and stakeholders within the National Institutes of Health. Early development featured collaborations among instructors and domain scientists at University of California, Davis, University of Minnesota, Cornell University, University of Colorado Boulder, Pennsylvania State University, Duke University, University of North Carolina at Chapel Hill, University of Michigan, University of Florida, Texas A&M University, University of Illinois Urbana–Champaign, Georgia Institute of Technology, and Arizona State University. Governance and lesson stewardship adapted to community needs via Steering Committees, lesson maintainers, and governance practices modeled after successful open-source projects such as Linux, Apache Software Foundation, Mozilla, GitHub, and Wikipedia.

Curriculum and workshops

The curriculum comprises domain-targeted lessons for disciplines including ecology, genomics, social sciences, geosciences, and humanities. Examples include curriculum created with partners at Monash University, University of Queensland, CSIRO, Sanger Institute, Wellcome Sanger Institute, European Bioinformatics Institute, Broad Institute, and Cold Spring Harbor Laboratory. Workshops typically run as two-day instructor-led events with pre-workshop setup and post-workshop follow-up. Instructional content aligns with computing environments and data sources from repositories such as Dryad, Figshare, Zenodo, GitHub, Bitbucket, Institutional Repositories at major universities, and domain databases like GenBank, European Nucleotide Archive, Protein Data Bank, Global Biodiversity Information Facility, PANGAEA, USGS National Geospatial Program, and NOAA archives.

Pedagogy and learning resources

Pedagogical approaches emphasize learner-centered, evidence-based teaching techniques adapted from the literature and practices at institutions like Carnegie Mellon University, Stanford Graduate School of Education, Harvard Graduate School of Education, University of Washington Center for Teaching and Learning, and University College London Institute of Education. Methods include live coding, formative assessment, worked examples, and pair programming. Learning resources are openly licensed and maintained on collaborative platforms inspired by systems used by Creative Commons, Open Knowledge Foundation, OpenStax, edX, Coursera, MIT OpenCourseWare, and Khan Academy. Instructor training and certifications are provided in tandem with materials for mentorship and instructor development patterned after models from Teach for America, Peer Instruction, and Code Club initiatives.

Community and governance

The project operates as a community of volunteer instructors, maintainers, and steering committee members drawn from academic institutions, national labs, museums, and industry labs including Google, Microsoft Research, IBM Research, Amazon Web Services, Facebook (Meta Platforms, Inc.) Research Labs, Intel Labs, NVIDIA Research, Oracle Labs, SAP Research, Siemens, Bayer, Pfizer, Novartis, Roche, and non-profit research organizations. Governance includes elected and appointed roles that steward lesson repositories, instructor training, and code of conduct enforcement, similar to governance frameworks at Linux Foundation, Apache Software Foundation, and Python Software Foundation. Community support uses communication channels modeled on practices from Stack Overflow, Discourse, Slack Technologies, Mailman, and GitHub Discussions.

Impact and adoption

Adoption has spanned hundreds of workshops delivered globally, reaching learners at universities, research institutes, museums, and governmental research agencies such as United States Geological Survey, Environmental Protection Agency (United States), Centers for Disease Control and Prevention, Public Health England, Health Canada, European Centre for Disease Prevention and Control, World Health Organization, and Food and Agriculture Organization. Reports and evaluations by teams at University of Washington eScience Institute, Carnegie Mellon University Human-Computer Interaction Institute, University of Edinburgh Data Science Institute, and ETH Zurich Statistical Laboratory indicate improvements in reproducible practices, data management, and researcher confidence. The initiative influenced curriculum design in postgraduate programs at institutions like King's College London, University of Bristol, Lancaster University, University of Southampton, Drexel University, New York University, University of California, San Diego, and Boston University, and has been referenced in policy discussions involving the National Academies of Sciences, Engineering, and Medicine and funding agencies including the European Research Council.

Category:Open educational resources