Software Heritage

Software Heritage
Name	Software Heritage
Founded	2016
Founder	Antonio Di Bella; Roberto Di Cosmo; Stefano Zacchiroli
Headquarters	Paris
Type	Non-profit archive
Mission	Collect, preserve and share source code

Contents

History
Mission and Objectives
Collection and Archiving Methods
Architecture and Technology
Access and Services
Governance and Funding
Impact and Criticism

Software Heritage

Software Heritage is an international initiative to collect, preserve and share the publicly available source code of software projects. Established by a team of computer scientists and engineers, it collaborates with major research institutions, archival initiatives and industrial partners to build a universal archive of software source code. The project operates a large-scale repository infrastructure intended to interoperate with academic repositories, corporate code hosting platforms and cultural heritage organizations.

History

The project was announced following work by researchers affiliated with INRIA, Université Paris Diderot, École normale supérieure, Fonds National pour la Science and collaborators from Google, Microsoft Research, IBM Research and Intel. Early milestones included collaborations with GitHub, GitLab, Bitbucket and the Debian Project to ingest publicly hosted repositories. Significant events in its timeline include partnerships with UNESCO for recognition of software as cultural heritage, agreements with French Ministry of Culture, and cooperative projects with the Internet Archive and the Software Preservation Network. Founders and early advocates such as Antonio Di Bella, Roberto Di Cosmo and Stefano Zacchiroli engaged with communities including Free Software Foundation, Apache Software Foundation, Linux Foundation and Mozilla Foundation to expand scope. Over time, the archive extended collaborations to universities like Massachusetts Institute of Technology, Harvard University, University of Cambridge, and national libraries such as the Bibliothèque nationale de France and the Library of Congress. Development tracked open-science initiatives including those led by European Commission research programs and partnerships with corporate actors like Red Hat, Oracle, Facebook and Amazon Web Services.

Mission and Objectives

The stated mission centers on long-term preservation and broad accessibility, aligning with goals promoted by UNESCO and preservation principles endorsed by Blue Ribbon Task Force-type groups. Objectives include collecting source code from major hosting services such as GitHub, GitLab, SourceForge and Launchpad; providing a persistent identifier system analogous to scholarly identifiers used by CrossRef and ORCID; and enabling reproducible research practices echoed by initiatives at European Organization for Nuclear Research and CERN. The project aims to serve stakeholders including academic libraries like Oxford University Library, national archives such as National Archives (UK), standards bodies like W3C, and open-source communities exemplified by Debian Project and GNOME Foundation.

Collection and Archiving Methods

Collection relies on automated harvesting from platforms such as GitHub, GitLab, Bitbucket, SourceForge, Savannah (software) and archival dumps like those produced by GitLab Archive efforts and mirrors hosted by Software Heritage Mirror partners. Ingested artifacts include commits, trees, blobs and releases with provenance metadata linking to contributors found on GitHub, Launchpad, GitLab and federated code hosting networks. Archival workflows borrow practices from LOCKSS, CLOCKSS, and digital preservation standards used by International Federation of Library Associations and Institutions and ISO committees. The project implements deduplication and Merkle DAG techniques comparable to those used by systems at Linux Kernel Archive and distributed projects like IPFS and BitTorrent. Collaborations with package ecosystems such as PyPI, npm, Maven Central and CRAN help ensure links between source trees and released artifacts maintained by organizations including Apache Software Foundation and Eclipse Foundation.

Architecture and Technology

The technical stack employs Merkle graph data structures inspired by concepts used in Git and research from INRIA and École Polytechnique. Storage infrastructure uses clustered file systems and object stores comparable to deployments at CERN OpenStack and cloud providers such as Amazon Web Services and Google Cloud Platform while relying on compute contributions from academic centers like CNRS and CINES. Indexing and search components integrate technologies similar to Elasticsearch and Apache Lucene, and expose APIs analogous to those from GitHub API and GitLab API. Persistent identifiers are modeled after schemes used by DOI and registries such as CrossRef, while provenance metadata interoperates with standards promoted by Dublin Core and W3C PROV. The architecture supports mirroring strategies inspired by the Internet Archive and distributed replication practiced by Content Delivery Network operators.

Access and Services

Public access is provided through a web interface, programmatic APIs and bulk data dumps used by researchers from MIT, Stanford University, University of California, Berkeley and industrial R&D teams at Google, Microsoft and IBM. Services include code search, repository browsing, source code provenance tracking, and archival references used in scholarly communication by publishers such as Springer Nature and Elsevier. Integration with development platforms like GitHub, GitLab and CI/CD pipelines used by Travis CI and Jenkins enables reproducible build efforts. Educational institutions including University of Oxford, University of Cambridge and École normale supérieure use the archive for teaching software history and software archaeology projects.

Governance and Funding

Governance involves a supervisory board, advisory committees and partnerships with institutions such as INRIA, École normale supérieure, CNRS, Collège de France and philanthropic entities like the Alfred P. Sloan Foundation and national research agencies including ANR. Funding sources combine grants from entities such as European Commission Horizon programs, donations from foundations like Mozilla Foundation, corporate sponsorships from Red Hat and GitHub sponsors, and in-kind contributions from universities and national supercomputing centers such as PRACE and CINES. Legal and policy advice has been sought from organizations like Creative Commons, Open Source Initiative and legal scholars associated with Harvard Law School and Stanford Law School.

Impact and Criticism

Impact claims include enabling reproducible research in venues like NeurIPS, ICML, ACM conferences and facilitating code citation practices advocated by FORCE11 and journal publishers including IEEE and ACM. The archive has been used in empirical software engineering studies at University of Washington, Carnegie Mellon University and University of Toronto and supports digital heritage initiatives linked to UNESCO programs. Criticisms focus on legal and ethical issues raised by rights holders represented by Software Freedom Conservancy, interoperability concerns noted by maintainers at Debian Project and scalability challenges discussed in forums involving Linux Foundation and cloud providers such as Amazon Web Services. Debates with stakeholders including GitHub and GitLab have addressed takedown procedures, licensing detection, and personal data considerations under frameworks influenced by GDPR and national data protection authorities like CNIL.

Category:Digital preservation Category:Free and open-source software