Heritrix

Heritrix
Name	Heritrix
Developer	Internet Archive
Released	0 2003
Latest release version	3.4.0
Latest release date	15 December 2020
Programming language	Java
Operating system	Cross-platform
Genre	Web crawler
License	Apache License 2.0

Contents

Overview
Architecture and Design
Development and History
Usage and Deployment
Related Software and Alternatives

Heritrix. It is an open-source, extensible web crawler designed for large-scale, archival-quality harvesting of web content, developed primarily by the Internet Archive for its core mission. The software is engineered to respect robots.txt directives and other politeness policies while creating faithful, WARC-format copies of websites. As a cornerstone tool for digital preservation and web archiving, it has been widely adopted by national libraries, research institutions, and other organizations within the International Internet Preservation Consortium.

Overview

Heritrix serves as the primary crawling engine for the Internet Archive's Wayback Machine, capturing billions of web pages for historical and research purposes. The crawler is distinguished by its focus on fidelity and completeness, aiming to produce a true archival snapshot rather than a simple index. Key operational principles include strict adherence to robots.txt rules, configurable politeness delays to avoid overloading web servers, and comprehensive logging for digital curation. Its development is closely aligned with the needs of institutions like the Library of Congress and the British Library, which participate in global preservation initiatives such as the End of Term Web Archive.

Architecture and Design

The software is implemented in Java, providing cross-platform operation and leveraging the Apache Maven build system. Its modular architecture is built around a central CrawlController that manages a series of configurable processing chains, each handling specific tasks like URL frontier management, HTTP fetching, and content analysis. Key components include the Frontier for scheduling, Processor chains for content handling, and the Writer modules that output to the standard WARC file format. This design allows for extensive customization through Java-based modules, enabling integration with tools like Apache Tika for MIME type detection and Apache Solr for indexing.

Development and History

Initial development began at the Internet Archive in 2003, with the first public release supporting the nascent Wayback Machine. Major milestones include the 1.0 release in 2004 and the significant overhaul to version 3.x, which introduced a more modular architecture and improved WARC output. The project has been a collaborative effort, receiving contributions from partners within the International Internet Preservation Consortium, including the National Library of Finland and the Bibliothèque nationale de France. Development discussions and issue tracking have historically been conducted via public mailing lists and the Internet Archive's own infrastructure, fostering a community around digital preservation tools.

Usage and Deployment

Heritrix is deployed extensively by national and academic institutions for large-scale archiving projects. Notable deployments include the Library of Congress's capture of U.S. government websites, the UK Web Archive curated by the British Library, and the collective efforts of the International Internet Preservation Consortium's member archives. Typical operation involves configuring complex crawl jobs through XML or JSON files, specifying seed URLs, scope rules, and politeness settings. System administrators often run it on Linux servers within data center environments, pairing it with Apache Hadoop for large-scale analysis or OpenWayback for access and replay.

Within the ecosystem of web archiving tools, Heritrix is often compared and integrated with other open-source projects. The primary access and replay interface for its captures is OpenWayback, a Java application that powers the public Wayback Machine. For broader curation workflows, it is frequently used alongside Wget, HTTrack, and the Python-based Scrapy framework for more targeted crawls. Other archival crawlers include Brozzler, which uses a real Chromium browser for complex JavaScript-heavy sites, and WAIL, which provides a simplified graphical user interface. Commercial alternatives and services exist, such as those offered by Archive-It, a subscription service from the Internet Archive that itself utilizes Heritrix technology.

Category:Web archiving Category:Internet Archive Category:Free web crawlers

Overview

Architecture and Design

Development and History

Usage and Deployment

Related Software and Alternatives