Nutch — LLMpedia

Nutch
Name	Nutch
Developer	Apache Software Foundation
Initial release	2002
Programming language	Java
Repository	Apache Git
License	Apache License 2.0
Website	https://nutch.apache.org

Contents

History
Architecture
Features
Deployment and Use Cases
Development and Community

Nutch Nutch is an open-source web-search software project designed for large-scale web crawling and indexing. Originating from academic research, it became an Apache Software Foundation project and has been used alongside projects such as Hadoop, Lucene, Solr, ElasticSearch, and Mahout in production search stacks. Nutch integrates with distributed systems and storage technologies developed at organizations like Yahoo!, Facebook, Twitter, LinkedIn and with infrastructure tools from Apache Software Foundation projects such as ZooKeeper and HBase.

History

Nutch was created in 2002 by contributors connected to University of California, Berkeley, inspired by work in information retrieval from institutions like Stanford University, Massachusetts Institute of Technology, and Carnegie Mellon University. Early development focused on scalable crawling and text indexing, drawing on algorithms studied in conferences such as SIGIR, WWW Conference, and KDD. In 2005 the project entered the Apache Incubator and later became a top-level project, interacting with corporations including IBM, Yahoo!, Microsoft, and Google through integrations and research citations in venues like USENIX and VLDB. Over time Nutch adopted components from other Apache projects amid ecosystem shifts driven by efforts at Cloudera, Hortonworks, and cloud providers such as Amazon Web Services, Google Cloud Platform, and Microsoft Azure.

Architecture

Nutch’s architecture is modular and built primarily in Java with pluggable components for crawling, parsing, indexing, and scoring. The crawler coordinates with distributed storage and processing systems like Hadoop Distributed File System (HDFS) and MapReduce for batch operations, while indexing typically delegates to libraries such as Lucene or indexing servers like Solr and ElasticSearch. Parsing leverages third-party parsers and formats originating from projects including Apache Tika and integrates content handlers for formats associated with vendors like Adobe Systems and Microsoft Corporation. URL filtering and politeness mechanisms reference standards from bodies like IETF and use control conventions similar to those maintained by W3C. Components communicate via plugins and extension points influenced by modular designs from projects like OSGi and tools from Eclipse Foundation.

Features

Nutch provides features for large-scale web data collection and preparation: distributed crawling with polite scheduling compatible with robots.txt policies originating from IETF specifications, segment merging and deduplication techniques informed by research from Google Research and Yahoo! Research, and content parsing support for formats handled by Apache Tika and libraries contributed by Mozilla Foundation and Oracle Corporation. Its plugin framework enables connectors for protocols and formats tied to ecosystems such as HTTP/1.1, HTTPS, FTP, and content sources like Wikipedia, Wikidata, and Internet Archive. Nutch supports diverse scoring and ranking hooks that can integrate learning-to-rank models from toolkits such as RankLib and XGBoost, and can output to indexes consumable by Lucene, SolrCloud, and Elastic Cloud. Monitoring and operational features pair with systems like Prometheus, Grafana, and orchestration platforms including Kubernetes.

Deployment and Use Cases

Nutch has been deployed for enterprise search at companies including Yahoo!, for research crawls at institutions like Common Crawl and Internet Archive, and in bespoke search solutions for organizations such as BBC, Wikimedia Foundation, and universities including University of Cambridge and University of Oxford. Use cases span web archiving with partners like Library of Congress, vertical search implementations akin to services from Bing and DuckDuckGo, and data acquisition for analytics projects at firms such as Twitter and LinkedIn. Deployments commonly run on clusters managed by distributions from Cloudera or orchestration services from Amazon EMR and Google Kubernetes Engine, integrating security policies influenced by standards from OWASP and identity systems such as LDAP and Kerberos.

Development and Community

Nutch development is coordinated through the Apache Software Foundation governance model, with contributors from corporations, research labs, and independent developers associated with entities like Yahoo! Research, Cloudera, Hortonworks, Apache Lucene PMC, and academic groups at MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). The project communicates via mailing lists, issue trackers on Apache JIRA, and code repositories hosted with Apache Git; contributions follow the Apache Contributor License Agreement and review processes similar to those used in projects like Apache Hadoop and Apache Lucene. Community efforts include integration sprints, participation at conferences such as ApacheCon, and collaborations with adjacent projects like Apache Solr and Apache Tika to advance interoperability and maintainability.

Category:Apache Software Foundation projects