BigScience — LLMpedia

Contents

Background and Goals
Organization and Governance
Research and Development
Models and Datasets
Infrastructure and Open Science Practices
Collaborations and Community Engagement
Impact and Reception

BigScience BigScience was a year-long international research workshop and open collaboration focused on transparency, governance, and multilingual datasets for large language models. Initiated by a coalition of academic labs, tech companies, and nonprofit organizations, it sought to produce openly available models, data, and evaluation tools while exploring ethical, legal, and social implications. The project connected researchers across institutions and disciplines to coordinate dataset curation, model training, and policy-driven governance.

Background and Goals

BigScience emerged from conversations among researchers affiliated with Hugging Face, École normale supérieure, Mila (research institute), INRIA, and CNRS with input from contributors at Google Research, Meta Platforms, Microsoft Research, and Amazon Web Services. The initiative referenced precedents such as OpenAI, DeepMind, EleutherAI, and projects like GPT-3 and BERT to motivate open alternatives. Goals included creating multilingual resources comparable to Common Crawl, Wikipedia, and BooksCorpus while aligning with frameworks from UNESCO, OECD, European Commission, and ACM guidelines. Organizers cited events like the NeurIPS debates and the ICLR community calls as contextual drivers for transparent machine learning practices.

Organization and Governance

Governance drew on models used by The Alan Turing Institute, Max Planck Society, Wellcome Trust, and OpenAI Startup Fund for multi-stakeholder oversight. A steering committee included members from Collège de France, University of Montreal, Carnegie Mellon University, University of Cambridge, and ETH Zurich. Legal counsel referenced statutes and precedents from European Court of Justice, General Data Protection Regulation, Berne Convention, and national agencies such as CNIL and Federal Trade Commission. Ethics advisory contributions came from scholars associated with Harvard University, Stanford University, University of Oxford, and Massachusetts Institute of Technology.

Research and Development

Research teams spanned labs like Mila (research institute), Laboratoire d'Informatique de Paris 6 (LIP6), AI Now Institute, Montreal Institute for Learning Algorithms, Berkeley Artificial Intelligence Research (BAIR), Stanford Artificial Intelligence Laboratory, and Princeton University. Methodological choices echoed techniques from Transformer (machine learning model), Attention (machine learning), Adam (optimizer), and training regimes similar to those used for RoBERTa, T5, and GPT-2. Evaluation protocols referenced benchmarks such as GLUE, SuperGLUE, XTREME, and datasets like SQuAD, XNLI, and Common Voice. Safety and alignment discussions invoked work from Future of Life Institute, Partnership on AI, and scholars publishing at AAAI, ACL, and NeurIPS.

Models and Datasets

Model development targeted multilingual coverage inspired by corpora including Common Crawl, OpenSubtitles, EuroParl, and OSCAR. Datasets incorporated sources such as Wikipedia, Project Gutenberg, Wiktionary, Wikidata, and national corpora from Bibliothèque nationale de France, Library of Congress, Deutsche Nationalbibliothek, and Biblioteca Nacional de España. Model architectures and baselines compared to GPT-2, GPT-3, GPT-Neo, GPT-J, BERT, RoBERTa, XLM-R, and mBERT. Licensing and provenance issues referenced Creative Commons, Open Data Commons, European Data Protection Board, and precedents set by OpenStreetMap and Wikimedia Foundation.

Infrastructure and Open Science Practices

Compute and storage coordination involved partners such as NVIDIA, Google Cloud Platform, Amazon Web Services, Microsoft Azure, CERN IT Department, and Compute Canada. Reproducibility practices paralleled initiatives at The Turing Way, Open Science Framework, and Zenodo. Tooling relied on software ecosystems including PyTorch, TensorFlow, Hugging Face Transformers, Ray (software), Accelerate (library), Weights & Biases, and Kubeflow. Containerization and orchestration referenced Docker, Kubernetes, and cluster management used at facilities like European Grid Infrastructure and XSEDE.

Collaborations and Community Engagement

Community governance and contributor coordination mirrored structures from Wikipedia, Mozilla Foundation, Creative Commons, and Apache Software Foundation. Outreach included workshops at ACL, EMNLP, NeurIPS, ICLR, COLT, and festivals like Reinventing AI. Engagement with policymakers and civil society involved UNESCO, Council of Europe, Amnesty International, Electronic Frontier Foundation, and Access Now. Training materials and tutorials were disseminated through channels like Coursera, edX, YouTube, and university continuing education programs at Columbia University and University of Toronto.

Impact and Reception

Responses from academia, industry, and advocacy groups echoed debates seen in reactions to GPT-3 releases and policy assessments by European Commission white papers. Coverage appeared in outlets and forums frequented by members of Nature (journal), Science (journal), The New York Times, Wired (magazine), and MIT Technology Review. Critical perspectives referenced analyses from AlgorithmWatch, AI Now Institute, and scholars publishing in Journal of Artificial Intelligence Research and Harvard Law Review. The project influenced subsequent open-model efforts by groups like EleutherAI, LAION, and spurred discussion at panels hosted by World Economic Forum and OECD.

Category:Artificial intelligence projects