27.6. KMP String Search Algorithm¶
27.6.1. KMP String Search Algorithm¶
This apparently more efficient string search algorithm was discovered in the 1970s by D. E. Knuth, J. H. Morris, and V. R. Pratt. Consequently, it is known as the Knuth‑Morris‑Pratt (or KMP) algorithm. The key to its search efficiency is the following. When a mismatch occurs in a particular alignment at index \(p\) of \(sub\), then we must look to the character matches that occurred in the portion of \(sub\) preceding index \(p\). We are seeking a substring of sub in the portion of \(sub\) immediately prior to index \(p\) that matches a leading substring of \(sub\). Once found, \(sub\) may be realigned so that this leading substring overlays what had been the matching substring immediately prior to index \(p\). The character‑by‑character comparison can then proceed from the position of the prior mismatch.
Its use requires an initial pass through the string sub to determine the appropriate amount of realignment when a mismatch occurs at position \(p\) in \(sub\). Note that this determination is dependent only on \(sub\), not at all on \(master\). In effect, for each index \(p\), we seek the longest sequence of characters immediately preceding position \(p\) that matches a sequence at the beginning of \(sub\). We must qualify this slightly to avoid problems in the degenerate case, in which all characters preceding position \(p\) are the same. When this occurs, we restart the matching pass through sub at position \(p ‑ 1\). In other words, we specifically seek the maximum sequence of characters immediately preceding index \(p\) with length less than \(p\) such that this sequence matches a sequence at the beginning of \(sub\). We will store, for each index \(p\), the length of such a sequence in an array called \(align\). Given this definition of the \(align\) array, the following slideshow indicates how the KMP algorithm would work with a particular \(master\) and \(sub\) string.
The preceding slideshow has unveiled the following pseudocode for the KMP algorithm::
input: master string, sub string, align array m = 0 s = 0 while((s < sub.length) and (sub.length - s <= master.length - m)): if(master[m] == sub[s]): m++, s++ else if(s == 0): m++ else: s = align[s] if(s == sub.length): return m - sub.length else: return -1
See if you can predict how one step in this KMP algorithm would progress by trying the following exercise.
The creation of the \(align\) is itself an interesting algorithm and requires some explanation. Since \(align[p]\) must be less than \(p\), we start by initializing \(align\) to ‑1 and \(align\) to 0. Since the align array is computed for successive values of \(p\), \(align[p ‑ 1]\) will have been computed by the time we attempt to compute \(align[p]\), allowing us to iterate through the computation of the \(align\) array as indicated in the following slideshow.
The preceding slideshow has illustrated the following pseudocode for the computation of the \(align\) array in the KMP algorithm::
align = -1 align = 0 L = string.length for(p = 2; p < L; p++): q = align[p-1] while((q>= 0) and (string[q] != string[p-1])): q = align[q] align[p] = q+1
See if you can predict how one step in this \(align\) algorithm would progress by trying the following exercise.
To indicate that you have fully mastered the intricacies of the KMP algorithm, you must now succeed in working your way through the following three exercises:
- Exercise in counting shifts and compares needed by KMP algorithm
- Exercise in determining strings with specified number of shifts and compares