WikiText-103 — LLMpedia

WikiText-103
Name	WikiText-103
Released	2017
Creator	Salesforce Research
Type	Dataset
Domain	Wikipedia
Size	103000000
Languages	English

Contents

Overview
Composition and Data Collection
Preprocessing and Tokenization
Usage in Language Modeling Research
Benchmarks and Performance Results
Limitations and Criticisms

WikiText-103 is a large-scale English text corpus created for training and evaluating autoregressive language models. It was released by Salesforce Research to support experiments in natural language processing, especially in contexts where long-range dependencies and coherent document-level modeling are required. The corpus draws exclusively from high-quality Wikipedia articles selected to provide diverse topics spanning history, science, culture, and biography.

Overview

The corpus was assembled to address deficiencies in earlier benchmarks such as Penn Treebank, IMDB (film), and One Billion Word Benchmark by offering longer contiguous text from encyclopedic entries. It emphasizes article-level continuity found in entries about figures like Albert Einstein, entities such as Mount Everest, institutions like Harvard University, and events such as the Battle of Waterloo, enabling evaluation of models on discourse phenomena. Researchers from Salesforce Research aimed for a dataset that complements corpora used by teams at Google Research, OpenAI, Facebook AI Research, and academic groups at Stanford University, Massachusetts Institute of Technology, and University of Oxford.

Composition and Data Collection

WikiText-103 contains over 100 million tokens extracted from several thousand Wikipedia articles chosen for quality and length. The selection process prioritized featured and good articles as identified by Wikipedia project standards, which include entries on personalities like Isaac Newton, works like The Odyssey, places like Paris, and institutions like the United Nations. The dataset excludes pages classified under templates, disambiguation pages, and talk pages maintained by communities including Wikimedia Foundation. Collection scripts mirror methods used in other corpora such as those underlying datasets from Common Crawl derivatives and were influenced by extraction practices used in datasets curated at University of California, Berkeley and Carnegie Mellon University.

Preprocessing and Tokenization

Preprocessing removed markup artifacts while preserving sentence and paragraph boundaries present in articles such as descriptions of World War II campaigns or entries on Charles Darwin. Boilerplate removal excluded elements like infobox templates tied to List of Presidents of the United States or navigational templates linking to Nobel Prize laureates. Tokenization schemes employed in downstream work vary: many teams use byte-pair encoding variants implemented in toolkits from Google's SentencePiece or Open-source libraries used by Hugging Face, while others apply WordPiece tokenization popularized by models developed at Google Research. Typical pipelines maintain capitalization and named-entity spans for proper nouns like Marie Curie, Amazon River, New York City, and Silicon Valley to preserve linguistic signals used by models trained at institutions such as University of Cambridge and ETH Zurich.

Usage in Language Modeling Research

Researchers leverage the corpus to train transformer and recurrent models designed by groups at OpenAI, DeepMind, Microsoft Research, and university labs including Princeton University and University of California, Berkeley. Studies compare architectures—such as Transformer (machine learning model), LSTM, and attention-augmented recurrent networks—on next-token prediction and downstream tasks including fine-tuning on datasets like GLUE and SQuAD. The dataset has been used to study techniques introduced in high-profile works from teams at Google Brain and Facebook AI Research, and to evaluate scaling laws discussed by authors affiliated with OpenAI and MIT. Benchmarks using this corpus often reference performance baselines set by influential models such as those in papers from ACL, NeurIPS, and ICML proceedings.

Benchmarks and Performance Results

Reported results on the corpus are typically measured in perplexity, token-level accuracy, and bits-per-character metrics. Transformer-based models from Google Research and OpenAI have achieved substantial perplexity reductions compared with recurrent baselines from earlier work at New York University and Stanford University. Results published at venues such as NeurIPS and ICML show that large-capacity models trained on this dataset outperform smaller models when evaluated on long-range coherence found in articles about Renaissance figures or scientific topics like Quantum mechanics. Comparisons often include datasets like the WikiText-2 subset and other corpora maintained by consortia including The Allen Institute for AI.

Limitations and Criticisms

Critiques of the dataset note biases inherent to Wikipedia coverage: overrepresentation of subjects from regions covered extensively in English-language media, such as United States, United Kingdom, and European Union states, and underrepresentation of topics from many Global South countries and indigenous communities. The reliance on featured and good articles privileges established figures like Winston Churchill and canonized works such as Hamlet, which can lead to skewed topical priors in models trained by teams at Industrial research labs and academia. Concerns have also been raised about temporal drift, since articles reflect knowledge as of the snapshot date, affecting subjects like COVID-19 pandemic and evolving biographies such as those of Elon Musk. Finally, ethical discussions in forums including panels at FAccT and AAAI emphasize the need to combine this corpus with diverse, contemporary sources to mitigate representational harms.

Category:Datasets