LLMpediaThe first transparent, open encyclopedia generated by LLMs

ProofreadPage

Generated by DeepSeek V3.2
Note: This article was automatically generated by a large language model (LLM) from purely parametric knowledge (no retrieval). It may contain inaccuracies or hallucinations. This encyclopedia is part of a research project currently under review.
Article Genealogy
Parent: Wikisource Hop 4
Expansion Funnel Raw 32 → Dedup 0 → NER 0 → Enqueued 0
1. Extracted32
2. After dedup0 (None)
3. After NER0 ()
4. Enqueued0 ()
ProofreadPage
NameProofreadPage
DeveloperWikimedia Foundation
Released0 2005
Programming languagePHP
Operating systemCross-platform
GenreMediaWiki extension
LicenseGNU General Public License

ProofreadPage is a MediaWiki extension originally developed to support the Project Gutenberg Distributed Proofreaders workflow within the Wikimedia ecosystem. It enables the collaborative, page-by-page transcription and verification of scanned texts, forming a core technical foundation for Wikisource, the free digital library. The extension facilitates a structured, multi-stage proofreading process that transforms image files of book pages into high-quality, searchable text.

Overview

The extension integrates seamlessly with the MediaWiki platform, allowing projects like Wikisource to host digital facsimiles alongside their transcribed text. It structures work around the "Index" page, which acts as a table of contents linking to individual page scans, typically sourced from institutions like the Internet Archive or the British Library. This system supports a wide range of languages and scripts, having been instrumental in digitizing works from the Dead Sea Scrolls to volumes from the Library of Congress. The primary goal is to produce accurate, verifiable, and freely licensed texts that can be used across other Wikimedia projects, such as Wikipedia and Wiktionary.

Features

Key functionalities include a dedicated proofreading interface that displays a page scan alongside an editable text field. Users can assign statuses to each page, such as "Not Proofread," "Proofread," and "Validated," creating a clear workflow. The extension supports detailed quality control through its validation system, often overseen by experienced users or administrators. It includes tools for optical character recognition (OCR) text import and namespace organization, separating "Page" and "Index" namespaces. Special pages like the "Proofread Stats" provide community metrics, and the system is designed to handle complex formatting, including the transcription of musical scores from sources like the Mutopia Project or historical documents from the National Archives of India.

Implementation

ProofreadPage is written in PHP and extends the core MediaWiki API. It relies on the underlying database structure of a MediaWiki installation to store page statuses, text versions, and user contributions. The extension defines custom namespaces and parser hooks to render the dual-view proofreading layout. It often works in conjunction with other tools and extensions, such as those for LaTeX rendering or integration with the Toolserver (and later Toolforge). Configuration is managed through MediaWiki's `LocalSettings.php` file, and its development has been influenced by collaborations with major cultural heritage digitization efforts, including partnerships with the Bibliothèque nationale de France.

Usage

The primary use case is on Wikisource, where communities proofread everything from ancient texts like the Epic of Gilgamesh to modern editions of works by William Shakespeare or Jane Austen. It is also employed in specific thematic projects, such as transcribing historical scientific papers from the Royal Society or legal documents like the Magna Carta. Users, ranging from volunteers to organized groups like WikiProject Source, coordinate efforts using talk pages and project portals. The finished, validated texts are then published as complete works, often becoming source material for citations on Wikipedia articles about events like the American Civil War or figures such as Leonardo da Vinci.

History

The extension was originally created in 2005 to replicate the functionality of the Distributed Proofreaders software for the then-new Wikisource project. Its development was championed by early Wikimedia contributors and received significant support from the Wikimedia Foundation. Over the years, it has undergone major updates to improve usability and scalability, influenced by the growing needs of large-scale digitization projects like those involving the HathiTrust Digital Library. Key milestones include its adoption as the default workflow for all multilingual Wikisource sites and its role in enabling mass transcription projects during initiatives like the GLAM-Wiki collaborations with the New York Public Library and the Smithsonian Institution.