The Rosetta Project

The Rosetta Project
Name	The Rosetta Project
Founded	2000
Founder	SIL International
Type	Linguistic archive
Location	Austin, Texas

Contents

Overview
History and Development
Objectives and Scope
Methods and Technologies
Collections and Content
Collaborations and Impact

The Rosetta Project is a large-scale initiative to create a publicly accessible archive of human languages and to preserve linguistic diversity through digital and physical media. It was initiated by SIL International and involves partnerships with universities, museums, indigenous organizations, and libraries to collect, document, and disseminate lexical, grammatical, and phonological data. The project bridges fieldwork traditions associated with Edward Sapir, Franz Boas, and Noam Chomsky with contemporary computational methods developed at institutions such as MIT, Stanford University, and University of Texas at Austin.

Overview

The project curates multilayered language records including wordlists, grammars, texts, and audio from communities represented in archives like Library of Congress, Smithsonian Institution, and British Library. It aims to make materials discoverable through standards advanced by International Organization for Standardization, Unicode Consortium, and World Wide Web Consortium while aligning with ethical frameworks promoted by United Nations Educational, Scientific and Cultural Organization and World Intellectual Property Organization. Public dissemination channels mirror strategies used by Internet Archive, Wikimedia Foundation, and Human Genome Project for broad accessibility.

History and Development

Conceived at the turn of the 21st century, the initiative grew out of collaborations among field linguists affiliated with SIL International, Max Planck Institute for Psycholinguistics, and School of Oriental and African Studies. Early development involved pilot corpora modeled after the collections of Edward Sapir and Franz Boas and drew on digital preservation techniques pioneered at Oak Ridge National Laboratory and Los Alamos National Laboratory. Major milestones included integration of metadata schemas influenced by Dublin Core and corpus architectures similar to projects at Oxford University and Harvard University. Funding and support have come from philanthropic bodies associated with Gordon and Betty Moore Foundation and research grants comparable to awards from the National Science Foundation.

Objectives and Scope

Primary objectives include creating a comprehensive, machine-readable index of human linguistic diversity to support revitalization efforts in communities connected to institutions such as First Nations University of Canada, University of Auckland, and Universidad Nacional Autónoma de México. The scope spans endangered languages studied by scholars linked to Linguistic Society of America, Association for Computational Linguistics, and Society for the Study of the Indigenous Languages of the Americas, extending to typological coverage comparable to resources like Ethnologue and Glottolog. It seeks interoperability with catalogues maintained by UNESCO Atlas of the World's Languages in Danger and national archives including National Museum of the American Indian.

Methods and Technologies

The project employs field methods rooted in traditions of Boasian anthropology and analytical frameworks inspired by figures such as Leonard Bloomfield and William Labov, while leveraging computational pipelines from Google Research, Microsoft Research, and open-source communities like GitHub. Technologies utilized include high-fidelity digital audio recording hardware used by teams at British Library Sound Archive, annotation tools influenced by ELAN (software), and database systems compatible with XML, JSON-LD, and RDF stacks promoted by W3C. Encoding follows standards like ISO 639 for language codes, orthographic practices involving Unicode Standard, and data citation norms advanced by Digital Object Identifier agencies.

Collections and Content

Collections encompass bilingual wordlists, morphological paradigms, narrative texts, and audiovisual corpora contributed by scholars associated with University of California, Berkeley, University of Hawaiʻi, and Yale University as well as community researchers from Inuit Tapiriit Kanatami, Ainu Association of Hokkaido, and Asociación de Comunidades Indígenas. Representative content types mirror archival holdings at Bibliothèque nationale de France and thematic collections like those curated by National Anthropological Archives. The physical artifact component echoes resilience strategies of the Long Now Foundation through durable media concepts, while digital distribution channels align with repositories maintained by Figshare and Zenodo.

Collaborations and Impact

Collaborative networks include partnerships with academic centers such as Max Planck Institute for Evolutionary Anthropology, University of Melbourne, and Australian National University, NGOs like Cultural Survival, and governmental bodies akin to National Endowment for the Humanities. The project's impact is evidenced in language revitalization programs in regions served by Alaska Native Language Center, educational curricula influenced by Smithsonian National Museum of the American Indian, and computational research leveraging corpora comparable to those at LDC (Linguistic Data Consortium). Its resources inform policy dialogues at UNESCO and contribute materials to comparative studies in typology referenced by scholars at Princeton University and University of Chicago.

Category:Linguistics Category:Digital archives Category:Language documentation