Aho–Corasick algorithm

Aho–Corasick algorithm
Name	Aho–Corasick algorithm
Class	String-searching algorithm
Data	Trie
Time	O(n + m + z)
Space	O(m)

Contents

Overview
Algorithm description
Complexity analysis
Example
Applications
Implementation considerations

Aho–Corasick algorithm. The Aho–Corasick algorithm is a String-searching algorithm for locating all occurrences of any of a finite set of strings within an input text. It was developed by Alfred V. Aho and Margaret J. Corasick at Bell Labs in 1975, building upon concepts from the earlier Knuth–Morris–Pratt algorithm. The algorithm preprocesses the set of patterns into a deterministic Finite-state machine with Failure function links, allowing it to scan the text in a single pass.

Overview

The algorithm constructs a Trie data structure from the given set of pattern strings, where each node represents a prefix of one or more patterns. It then augments this trie with three types of links: Goto function for transitioning on matching characters, Failure function for handling mismatches analogous to the KMP algorithm, and Output function for reporting matches. This transforms the trie into a Deterministic finite automaton that can process the input text in linear time relative to its length plus the total number of matches found. The efficiency and elegance of the approach made it a foundational technique in Computer science, particularly within the field of Computational linguistics.

Algorithm description

The construction proceeds in two main phases. First, the Goto function is built by inserting all pattern strings into a trie, with terminal nodes marking the end of a pattern. Second, the Failure function is computed using a Breadth-first search traversal of the trie, establishing links to the longest proper suffix of the current prefix that is also a prefix of some pattern. This process is reminiscent of the Prefix function in the Knuth–Morris–Pratt algorithm. During the search phase, the algorithm traverses the automaton character by character from the input text, following goto links on matches and using failure links on mismatches to avoid re-examining previously matched characters. Whenever a node with an associated Output function is reached, all matching patterns ending at that position are reported.

Complexity analysis

The time complexity for building the automaton is O(m), where m is the combined length of all pattern strings. The search phase runs in O(n + z) time, where n is the length of the input text and z is the total number of output matches. This linear performance is a significant advantage over naive multi-pattern search methods. The space complexity for storing the automaton is O(m), specifically proportional to the number of nodes in the trie. This efficiency has been analyzed in seminal works like Introduction to Algorithms by Thomas H. Cormen and is a standard topic in courses at institutions like Massachusetts Institute of Technology.

Example

Consider searching for the patterns {"he", "she", "his", "hers"} within the text "ushers". The algorithm first builds a trie containing these strings. The Failure function would link, for instance, the node for "sh" to the node for "h", as "h" is the longest suffix of "sh" that is also a prefix of "he" or "his". When processing the text, starting at the root, it would successfully match "she" at positions 2-4 and "he" at positions 3-4, and also "hers" starting at position 3. This simultaneous matching demonstrates the power of the Deterministic finite automaton approach, a concept also central to tools like GNU Grep and the Snort (software) intrusion detection system.

Applications

The algorithm is widely used in practical systems requiring high-speed multi-pattern matching. It forms the core of many Intrusion detection system tools like Snort (software) and ClamAV for virus signature detection. In Computational biology, it is employed in DNA sequencing for searching multiple motifs. The Unix utility GNU Grep utilizes a variant for fixed-string searching. It is also fundamental in Natural language processing tasks within Computational linguistics, such as keyword spotting and text mining. Furthermore, it is implemented in various Programming language libraries, including the Apache Lucene search engine.

Implementation considerations

Efficient implementation often involves representing the Goto function as a sparse array or Hash table to conserve memory, especially for large alphabets like Unicode. The Failure function links can be precomputed and stored within each node structure. For very large pattern sets, techniques from the Commentz-Walter algorithm or incorporating Bit-parallelism can offer complementary optimizations. The algorithm has been adapted into concurrent versions for Multi-core processor systems and is a staple in competitive programming contests like the International Collegiate Programming Contest. Its integration into systems like the Linux kernel for string filtering underscores its enduring practical relevance.

Category:String-searching algorithms