OpenWebText — LLMpedia

OpenWebText
Name	OpenWebText
Introduced	2019
Creators	OpenWebText contributors
Type	Web-scraped text dataset
Domain	Multidomain English web content
License	Various web licenses (see Licensing and Ethical Considerations)
Usage	Pretraining language models, NLP research

Contents

Overview
Dataset Construction and Sources
Format and Content Characteristics
Usage and Applications
Evaluation and Limitations
Licensing and Ethical Considerations

OpenWebText OpenWebText is a large-scale corpus of English web text assembled to replicate sources used in training major transformer language models such as those developed by research groups at OpenAI, Google Research, DeepMind, Microsoft Research, Facebook AI Research, and NVIDIA. Intended as an open replication of proprietary corpora referenced in publications by teams like Radford et al. and institutions associated with the GPT family, OpenWebText aimed to provide an accessible benchmark for model pretraining, evaluation, and comparative study across organizations including Stanford University, Massachusetts Institute of Technology, Carnegie Mellon University, University of California, Berkeley, and University of Washington.

Overview

OpenWebText was created to approximate the distribution of web text used by influential models from groups such as OpenAI and research labs at Google DeepMind and Facebook AI Research. The dataset aggregates public English-language content curated from links shared on social platforms like Reddit and indexed by web archives associated with projects at Common Crawl and institutions like the Internet Archive. Developers and researchers from labs at EleutherAI, Hugging Face, Allen Institute for AI, MIT CSAIL, and independent contributors used tooling and infrastructure from platforms such as GitHub, Docker, Kubernetes, and cloud providers like Amazon Web Services, Google Cloud Platform, and Microsoft Azure to collect and process the corpus. The corpus has been referenced in comparative studies published by groups at UC Berkeley AI Research (BAIR), Oxford University, Cambridge University, ETH Zurich, and Tsinghua University.

Dataset Construction and Sources

Constructors sourced material primarily from web pages whose URLs were highly shared on community platforms like Reddit and indexed by crawlers from Common Crawl coordinated with archiving by the Internet Archive. The collection pipeline leveraged tools and repositories hosted on GitHub and used parsers and heuristics developed by contributors at EleutherAI, Hugging Face, Allen Institute for AI, and research teams from NYU, Princeton University, Columbia University, and University of Toronto. Metadata filtering employed signals from services such as Moz, Ahrefs, and link graphs studied in work from Stanford Network Analysis Project (SNAP). Source diversity reflects sites run by organizations and outlets including New York Times, The Guardian, BBC News, Wired (magazine), arXiv, Medium (website), Wikipedia, and independent blogs and forums hosted across platforms used by entities like WordPress Foundation, Substack, and Tumblr. The pipeline also excluded pages flagged by automated lists curated by projects associated with Spamhaus and community moderation archives from Reddit moderators and repository maintainers at Hugging Face.

Format and Content Characteristics

OpenWebText is formatted as deduplicated plaintext documents with tokenization targets compatible with byte-pair encoding schemes used by model implementations from OpenAI, Google Research, Hugging Face, Fairseq, and TensorFlow or PyTorch ecosystems. The corpus contains narrative forms drawn from outlets such as The Washington Post, Bloomberg L.P., Nature (journal), Science (journal), and technical writing from preprint servers like arXiv and documentation portals from organizations like Mozilla Foundation and Linux Foundation. It also includes creative writing from communities around publishers like Wattpad and archived forums including Stack Overflow and GitHub Issues associated with software projects from organizations such as Microsoft, Google, Facebook, and Apache Software Foundation. Content spans genres present in collections analyzed by researchers at Berkeley AI Research, CMU Language Technologies Institute, Harvard John A. Paulson School of Engineering and Applied Sciences, and Yale University.

Usage and Applications

Researchers have used OpenWebText to pretrain transformer-based models comparable to architectures introduced by Radford et al. and operationalized in frameworks by Hugging Face Transformers, OpenNMT, Fairseq, and model zoos hosted by TensorFlow Hub and PyTorch Hub. Applications include baseline evaluations in studies from Stanford NLP Group, MITRE Corporation, SRI International, and labs at NYU Langone Health focusing on biomedical text. Other uses span fine-tuning workflows in projects at DeepMind, Microsoft Research Cambridge, Facebook AI Research Paris, and startups incubated at Y Combinator and accelerators like Techstars. Comparative benchmarks used OpenWebText in shared tasks organized by conferences like ACL, EMNLP, NeurIPS, ICLR, and NAACL.

Evaluation and Limitations

Evaluations of models pretrained on OpenWebText occur in contexts such as perplexity and downstream tasks reported at venues including NeurIPS and ICLR, with analyses from teams at Google Research, DeepMind, Facebook AI Research, and academic groups at University College London and ETH Zurich. Limitations noted by investigators from EleutherAI, Hugging Face, Carnegie Mellon University, Oxford University, and Princeton University include topical biases toward sources popular on platforms like Reddit, underrepresentation of certain regional outlets such as Folha de S.Paulo or Le Monde when English is primary, and challenges replicating proprietary curation strategies used by OpenAI and Google. Additional concerns raised in workshops at IETF and panels at AAAI and ACM highlight issues of duplication, temporal drift, and representation compared to corpora like C4 (Colossal Clean Crawled Corpus) and datasets assembled by Common Crawl initiatives.

Licensing and Ethical Considerations

Licensing of content within OpenWebText is heterogeneous: contributors documented sources ranging from permissively licensed texts on platforms like Wikipedia and permissive repositories maintained by Apache Software Foundation projects to copyrighted journalistic works from The New Yorker and The Atlantic (magazine). Ethical discussion involving organizations such as Electronic Frontier Foundation, ACLU, Creative Commons, OpenAI, Google, and academic ethicists at Harvard Kennedy School addressed consent, fair use, and potential privacy issues for content scraped from forums and comments hosted on domains owned by companies like Reddit (company), Twitter, Inc., Facebook, Inc., and Stack Exchange. Policy forums at European Commission and advisory reports from National Institutes of Health and UK Information Commissioner's Office informed subsequent data handling recommendations adopted by institutions including MIT, Stanford University, UC Berkeley, and industry labs at Microsoft and IBM Research.

Category:Datasets