Language Bank of Norway

Language Bank of Norway
Name	Language Bank of Norway
Established	2007
Location	University of Oslo, Norway
Type	Research infrastructure

Contents

Overview
History
Organization and Governance
Collections and Resources
Access and Services
Research and Applications
Legal and Ethical Issues

Language Bank of Norway is a national infrastructure for annotated language data and tools for computational linguistics and digital humanities based at the University of Oslo. It provides curated corpora, lexica, and annotations to support projects in natural language processing, speech technology, and language documentation involving Norwegian Bokmål, Norwegian Nynorsk, and minority and immigrant languages. The service interacts with academic, commercial, and public institutions including archives, libraries, and research councils.

Overview

The Language Bank of Norway aggregates datasets from partners such as the National Library of Norway, Norwegian School of Economics, Norwegian University of Science and Technology, and major media outlets like Aftenposten, NRK, and VG. It collaborates with international organizations including CLARIN, ELRA, UTC, and research projects funded by the Research Council of Norway and the European Commission. Users range from teams at University of Cambridge, Stanford University, Massachusetts Institute of Technology, Saarland University, and University of Helsinki to companies such as Google, Microsoft, Apple, Amazon, and startups in Oslo Innovation Center.

History

Origins trace to early corpus initiatives at University of Bergen and computational linguistics groups at University of Tromsø and UiT The Arctic University of Norway. Early collections built on work by researchers such as Arne Torp and institutions like Norwegian Broadcasting Corporation and the National Library. Major milestones include integration with the CLARIN ERIC framework, grants from the Research Council of Norway, and partnerships with projects like Språkbanken (Sweden), NoSketchEngine, Copernicus, and the Tromsø CLARIN Center. Strategic expansions followed developments at conferences such as ACL, EACL, LREC, and NAACL.

Organization and Governance

Governance involves stakeholders including the University of Oslo, the Norwegian Ministry of Culture, and the Norwegian Directorate for Higher Education and Skills. Advisory boards include representatives from University of Stavanger, BI Norwegian Business School, Norsk Regnesentral, and the Norwegian Language Council. Operational units coordinate with archives like the National Archives of Norway and museums such as the Norwegian Museum of Cultural History. Ethical oversight engages bodies like Datatilsynet and research ethics committees at Oslo Metropolitan University and University of Bergen.

Collections and Resources

Collections encompass written corpora from publishers such as Gyldendal Norsk Forlag, Aschehoug, and Cappelen Damm, speech corpora from Telenor and broadcast archives at NRK, and specialized datasets from projects at SINTEF and NorCE. Lexical resources include entries aligned with resources like WordNet, Wiktionary, and multilingual alignments involving Europarl, OpenSubtitles, and Common Crawl. Annotated corpora follow standards from TEI, LAF, and ISO norms, and include treebanks and dependency annotations compatible with datasets produced by teams at University of Oslo Computational Linguistics Group, Uppsala University, and University of Stuttgart.

Access and Services

Access policies balance rights managed by publishers such as Schibsted and public institutions like the National Library of Norway with research use agreements similar to frameworks used by British Library and Bibliothèque nationale de France. Services include secure access portals modeled after CLARIN services, metadata harvesting compatible with OAI-PMH, and APIs used by projects at Graphcore and NVIDIA. Training and outreach are conducted in cooperation with centers like Language Technology Group at UiO and summer schools associated with LREC and EMNLP.

Research and Applications

Research enabled by the Language Bank supports work in areas advanced at venues such as ACL, EMNLP, COLING, and ISCA with applications in machine translation (informed by Google Translate and DeepL), speech recognition akin to systems from Kaldi and Mozilla Common Voice, information retrieval comparable to Elasticsearch and Lucene, and language preservation efforts parallel to those by UNESCO. Collaborative projects have involved partners like Facebook AI Research, DeepMind, Apple Speech, and academic labs at University of Edinburgh and ETH Zurich.

Legal and Ethical Issues

Legal frameworks reference legislation such as Personal Data Act (Norway), rights held by publishers like Schibsted Media Group, and obligations under European Union directives affecting cross-border data use. Ethical issues intersect with standards from Association for Computational Linguistics, privacy authorities like Norwegian Data Protection Authority (Datatilsynet), and archival release policies used by institutions such as the National Archives of Norway and National Library of Norway. Debates include licensing models similar to Creative Commons, data minimization principles advocated by Council of Europe, and consent practices researched at University of Bergen and UiT.

Category:Language archives Category:Corpus linguistics