Lexibank — LLMpedia

Lexibank
Name	Lexibank
Type	lexical database
Scope	comparative lexicography
Established	2017
Creator	Thomas Widmann
Developers	Max Planck Institute for Evolutionary Anthropology; University of Zurich
License	Creative Commons

Contents

Lexibank

Lexibank is an open lexical database project compiling standardized wordlists from diverse language families for comparative research. It aggregates lexical data from fieldwork, corpora, and historical sources to support studies in historical linguistics, computational linguistics, and evolutionary biology approaches to language. The project integrates data curation, reproducible workflows, and interoperable formats to enable reuse across projects associated with institutions such as the Max Planck Institute for the Science of Human History, the Max Planck Institute for Evolutionary Anthropology, and the University of Zurich.

Overview

Lexibank organizes lexical datasets into machine-readable wordlists linked to metadata about sources and informants, facilitating cross-family comparisons among families like Indo-European languages, Sino-Tibetan languages, Austronesian languages, Niger–Congo languages, and Uralic languages. It employs standards from initiatives such as Cross-Linguistic Data Formats (CLDF), connects to lexical resources like Glottolog, Ethnologue (database), and complements phylogenetic projects exemplified by work at the Santa Fe Institute and the Institute for Advanced Study. Contributors include researchers affiliated with the Max Planck Society, the University of Cambridge, and the University of Oxford.

Conceived in the context of reproducible data science, the project gained momentum following collaborations among scholars at the Max Planck Institute for Evolutionary Anthropology, the University of Zurich, and the University of Edinburgh. Early development drew on software ecosystems promoted by groups at the Center for Open Science and the Open Science Framework, and methodological influence from the Human Relations Area File community and the World Atlas of Language Structures. Key contributors and coordinators have collaborated with researchers from Harvard University, Stanford University, and the University of California, Berkeley on phylogenetic and typological synthesis. The project has grown through integration with repositories and platforms like the GitHub ecosystem and partnerships with initiatives at the Natural History Museum, London and the Smithsonian Institution.

Datasets in the project cover lexical items across semantic domains including core vocabulary used in comparative work, with entries annotated for phonetic form, glosses, and etymological notes drawn from sources such as field notebooks from researchers at the School of Oriental and African Studies, archives at the British Library, and corpora curated at the Linguistic Data Consortium. Data is organized using standards aligned with Cross-Linguistic Data Formats and linked to identifiers from Glottolog and authority records from institutions like the Library of Congress and the Deutsche Nationalbibliothek. Lexical entries often reference classical corpora such as the Comparative Indo-European Database and genealogical classifications employed by specialists at the Max Planck Institute for the Science of Human History.

Lexibank applies reproducible pipelines using tools and practices endorsed by the ReproZip community and computational workflows influenced by projects at the Allen Institute for AI and the European Research Council. Standardization involves phonetic transcription practices rooted in the International Phonetic Association conventions and metadata schemas interoperable with the Open Archives Initiative protocols. Quality control integrates peer review models similar to those at the Royal Society and curatorial procedures used by the Smithsonian Institution archives. The project interfaces with phylogenetic methods popularized by researchers at the Max Planck Institute for Evolutionary Anthropology and computational modelers at the Santa Fe Institute.

The project provides programmatic access through repositories compatible with the GitHub workflow, citation guidance aligned with Digital Object Identifier practices, and downloadable datasets packaged for use with tools such as CLDFBench, LingPy, and ECDICT-style toolchains. Visualization and analysis pipelines have been demonstrated in workshops at the Linguistic Society of America and training events hosted by the European Language Resources Association. Interoperability enables integration with digital infrastructures at the Max Planck Digital Library and archival systems used by the British Library and Bibliothèque nationale de France.

Researchers employ the datasets for comparative reconstruction in studies addressing questions pursued by teams at Harvard University, University of Cambridge, and the Max Planck Institute for the Science of Human History, including work on lexical diffusion, contact phenomena in regions studied by scholars at the School of Oriental and African Studies and the Australian National University, and macroevolutionary modeling practiced at the Santa Fe Institute. Lexibank-informed research has contributed to publications appearing in journals associated with the Linguistic Society of America, the Proceedings of the National Academy of Sciences of the United States of America, and presses linked to the Cambridge University Press. The infrastructure supports educators and curators at institutions such as the Smithsonian Institution and the Natural History Museum, London in making lexical data available for interdisciplinary research.

Category:Linguistics databases