RetroSheet — LLMpedia

RetroSheet
Name	RetroSheet
Formation	1989
Founder	David Smith
Type	Nonprofit
Purpose	Baseball play-by-play and box score reconstruction
Headquarters	United States

Contents

History
Data Collection and Sources
Database Structure and Contents
Tools and Distribution
Usage and Impact
Access and Licensing

RetroSheet

RetroSheet is an independent nonprofit organization dedicated to reconstructing and distributing historical Major League Baseball play-by-play and box score data. Founded in the late 20th century, the project compiles exhaustive game-level records drawing on newspapers, scorebooks, and archival materials to serve researchers, historians, and statisticians associated with Baseball-Reference, Society for American Baseball Research, National Baseball Hall of Fame and Museum, Major League Baseball Players Association, and university archives. Its datasets underpin analyses used by authors, broadcasters, and institutions such as ESPN, Fangraphs, Chicago Cubs, and New York Yankees historical projects.

History

RetroSheet began as a volunteer effort inspired by initiatives at Society for American Baseball Research and individual researchers like David Smith who sought to preserve the historical record beyond sources held by National Baseball Hall of Fame and Museum and Library of Congress. Early collaborators included collectors connected to the Baseball Hall of Fame's archives, retired statisticians formerly with MLB Advanced Media and newspaper sports desks at outlets like the New York Times and Los Angeles Times. The project expanded through partnerships with regional historical societies, minor league historians associated with the International League and Pacific Coast League, and academic researchers at institutions including University of Michigan, Stanford University, and University of Pennsylvania. Over decades, contributors coordinated with organizations such as the National Archives, Smithsonian Institution, Public Broadcasting Service, and independent sabermetricians linked to Bill James and Tom Tango.

Data Collection and Sources

Data sources include box scores and play-by-play accounts from newspapers like the New York Daily News, Chicago Tribune, Boston Globe, and San Francisco Chronicle; team scorebooks donated by franchises such as the Cincinnati Reds and St. Louis Cardinals; and league records from American League and National League offices. Volunteers digitized material from microfilm collections at the Library of Congress and the New York Public Library, plus personal archives associated with figures like Retrosheet contributors who procured documents from estates of players like Babe Ruth and Ted Williams. The project also consulted yearbooks and guides such as the Spalding Guide and Official Baseball Guide and collaborated with statistical repositories like Baseball Almanac and Society for American Baseball Research committees focused on data integrity. Legal and institutional interactions involved entities like Major League Baseball and the Copyright Office when determining reuse of newspaper transcriptions and scorebook images.

Database Structure and Contents

The dataset is organized into game-level event logs, box scores, roster files, and schedule listings covering seasons from early 20th century campaigns through modern eras that include the Commissioner of Baseball's sanctioned records. Records capture plate appearances, substitutions, scoring plays, and umpire assignments, linking to player entries for athletes such as Joe DiMaggio, Jackie Robinson, Mickey Mantle, Hank Aaron, and Willie Mays. The schema supports cross-references to team-season pages for franchises like the Boston Red Sox, Detroit Tigers, Brooklyn Dodgers, Pittsburgh Pirates, and Philadelphia Phillies. Ancillary files document game venues including Ebbets Field, Yankee Stadium, Fenway Park, and Wrigley Field, and encode event attributes used by analysts at outlets like Baseball Prospectus and universities running sabermetrics courses at Columbia University and University of Chicago.

Tools and Distribution

Distribution channels include downloadable flat files, CSV exports used by researchers at Harvard University, and formats ingested by analytic platforms such as R Project for Statistical Computing and Python (programming language) libraries employed by data scientists at Google and Microsoft Research. Community tools developed around the dataset include parsers, game visualizers, and web interfaces inspired by projects at Baseball-Reference and Fangraphs, and integration scripts used by broadcasters at Fox Sports and NBC Sports. Volunteers and developers have exchanged code on platforms like GitHub and discussed methodology at conferences including SABR Analytics Conference and academic meetings hosted at MIT and Stanford University.

Usage and Impact

Researchers, journalists, and statisticians have used the data to study careers of players like Cy Young, Walter Johnson, Barry Bonds, Cal Ripken Jr., and Ichiro Suzuki; to re-evaluate historical achievements involving teams such as the Cleveland Indians, Kansas City Royals, Atlanta Braves, and San Diego Padres; and to support Hall of Fame research and biographies published by houses like Simon & Schuster and HarperCollins. The dataset has informed sabermetric work by figures associated with Bill James, Sean Forman, and Chris Jaffe, and has been cited in analyses for major outlets including The Athletic, Sports Illustrated, The Wall Street Journal, and academic journals at Oxford University Press. Its preservation efforts intersect with museum exhibits at the National Baseball Hall of Fame and Museum and educational programs at institutions such as Yale University and Princeton University.

Access and Licensing

Data are distributed under terms negotiated with contributors, institutions, and rights holders including newspapers and teams; licensing arrangements parallel practices at repositories such as Project Gutenberg for public-domain material and institutional agreements similar to those used by HathiTrust and Internet Archive for digitized content. Access tiers accommodate academic researchers from universities like University of California, Berkeley and independent developers complying with usage policies observed by organizations such as Creative Commons and Digital Public Library of America. The project coordinates with legal counsel familiar with United States Copyright Law and archival best practices promoted by the Society of American Archivists.

Category:Baseball statistics organizations