Knuth–Morris–Pratt algorithm

Knuth–Morris–Pratt algorithm
Class	String-searching algorithm
Data structure	String
Time	O(n + k)
Space	O(k)

Contents

Overview
Algorithm description
Prefix function
Complexity analysis
Example
Applications

Knuth–Morris–Pratt algorithm. The Knuth–Morris–Pratt algorithm is a seminal string-searching algorithm that efficiently finds occurrences of a pattern within a text. It was developed by Donald Knuth, James H. Morris, and Vaughan Pratt, with their joint paper published in 1977. The algorithm's key innovation is its preprocessing of the pattern to create a partial match table, which allows it to skip unnecessary comparisons, achieving linear time complexity in the worst case.

Overview

The algorithm improves upon naive methods by avoiding redundant character comparisons after a mismatch. This is achieved by precomputing a function based on the pattern itself, often called the failure function or prefix function. This preprocessing step was a significant advancement in the field of computer science, influencing later algorithms like the Boyer–Moore string-search algorithm. The work was famously reviewed by Donald Knuth in his series The Art of Computer Programming. The conceptual foundation relates to earlier automata-based approaches, such as those used in the Aho–Corasick algorithm.

Algorithm description

The procedure operates in two distinct phases: preprocessing and searching. During preprocessing, the algorithm constructs an array, often denoted as `lps` for "longest proper prefix which is also a suffix," for the given pattern. The search phase then iterates through the text string, using the precomputed array to determine how far the pattern can be shifted after a mismatch. This shift logic prevents backtracking in the text, a limitation of the naive string-search algorithm. The inventors formally presented this method at a SIAM conference, and it was later incorporated into standard texts like Introduction to Algorithms.

Prefix function

The core of the preprocessing is the computation of the prefix function, which for each position in the pattern records the length of the longest proper prefix that is also a suffix of the substring ending at that position. This function is computed in linear time relative to the pattern length using an efficient iterative method. The mathematical properties of this function are deeply connected to concepts in combinatorics on words and formal language theory. Its computation is analogous to building a deterministic finite automaton for the pattern, a concept also seen in the KMP automaton.

Complexity analysis

The algorithm's time complexity is O(n + k), where n is the length of the text and k is the length of the pattern. This represents a worst-case linear time guarantee, a substantial improvement over the quadratic worst case of simpler methods. The space complexity is O(k), required to store the prefix function array. This analysis was rigorously proven in the original paper published in the SIAM Journal on Computing. The efficiency makes it suitable for searching in large datasets, such as those in computational biology or network intrusion detection systems.

Example

Consider searching for the pattern "ABCDABD" within a text. The preprocessing step constructs the `lps` array as [0,0,0,0,1,2,0]. During the search, if a mismatch occurs after matching "ABCDAB", the `lps` value indicates a shift that realigns the second "AB" in the pattern. This example illustrates the algorithm's avoidance of re-checking characters known to match, a principle later refined in algorithms like the Zhu–Takaoka string matching algorithm. Such examples are commonly used in courses at institutions like MIT and Stanford University.

Applications

The Knuth–Morris–Pratt algorithm is foundational in stringology and is implemented in many standard libraries for programming languages. It is particularly useful in text editors for find operations, in digital forensics for data carving, and in search engines for substring queries. Variants of its logic are employed in bioinformatics tools for DNA sequence alignment and in network security software for signature-based detection. Its principles also underpin more complex algorithms used in the Unix utility grep and within the Linux kernel.

Category:String-searching algorithms