Knuth-Morris-Pratt Algorithm

Tutorials

07 Mar


Given some pattern p and some string s, how can we efficiently find out whether or not the pattern p exists within the string s as a substring of s?

In this tutorial, we discuss the workings of the Knuth-Morris-Pratt (KMP) string searching algorithm.


The General Idea Let's say we have the following string, s, and pattern, p:
  • s = "abra abracad abracadabra"
  • p = "abracadabra"

If this were the naive approach, we would need quite a lot of comparisons – 154 to be exact! Before we start, it would be best to note that we'll be using zero-based arrays for this. So since the string s is really an array of characters, then s[0] would be referencing to the character 'a', and p[2] would be the character 'r'.

We will also need two more variables: i and j. Here, i will be the index into the pattern, p; and j will be the index into the string, s. Let's start off with the first match-up:

  • j = 012345678901234567890123
  • s = abra abracad abracadabra
  • p = abracadabra
  • i = 01234567890

The values of the index for j and i are 0-33 and 0-10, respectively.
Here we can see that p[0] and s[0] are equal (a=a), but then we reach s[4] and p[4]. Here we have a space character and the character 'c'. Clearly they don’t match up. So what we need to do now is slide the pattern along. This is the part that isn't like the naive approach. Instead of sliding the pattern across by one, we slide the pattern across by a special value. You see, behind the scenes there is in fact a table, T. We'll get to this table later, but for now all you need to know is that we slide the pattern across to j=3 and reset i=0.

We get the value of j=3 from: j = j+i-T[i] = 0+4–1 = 3. In this equation, j=0 because 0 is the starting point for this match-up (next time it will be 3). i=4, because this is the point of failure – the position in p where we found our mismatch. And T[i]=1 because, well we'll get to this later, for now just trust me.

With these new values, we can now slide the pattern along:

  • j = 012345678901234567890123
  • s = abra abracad abracadabra
  • p =    abracadabra
  • i =    01234567890

Now we see that we find a mismatch at s[4] and p[1]. Therefore, we slide the pattern along to j=4. If you wish to know why, it’s because j=3, i=1 and T[i]=0, sticking them into that equation gives 3+1-0 = 4.

For now, we shall skip over the next match-up and jump to the one after:

  • j = 012345678901234567890123
  • s = abra abracad abracadabra
  • p =      abracadabra
  • i =      01234567890

This should be better to show off why this algorithm is so good. Here we compare the characters again, and fail at s[12] and p[7]. But because of the way the algorithm works, we slide the pattern across by a huge amount (again, skipping the simple step of failing on the first character; but it should be slid along to j=12):

  • j = 012345678901234567890123
  • s = abra abracad abracadabra
  • p =              abracadabra
  • i =              01234567890

As you can see the next step succeeds, and we find a match at j=13! What's more amazing is that we have only made 28 comparisons; a lot better than the naive approaches 154.


So, what is this T table? The KMP algorithm has an initial pre-processing stage. This is a stage that is computed before the algorithm starts the main string searching. During this pre-processing stage, a table is generated from the given pattern. Let's take a look at what the table looks like for the above pattern:

i 0 1 2 3 4 5 6 7 8 9 10
p[i] a b r a c a d a b r a
T[i] -1 0 0 0 1 0 1 0 1 2 3

Now this may look rather complex, but it really isn’t that hard to understand. First things first, the values of T[0] and T[1] are always fixed, they never change. Therefore, T[0]=-1 and T[1]=0 all of the time. This means that the main program begins from i=2. There is also another variable that we need to use, and we shall call this variable sub, with an initial value of sub=0.

And here is some pseudo-code to populate the T table:

//initialise T[0] and T[1] T[0] = -1 T[1] = 0 //initialise i and sub i = 2 sub = 0 //whilst i does not exceed the length of p WHILE (i < length_of_p) DO: //we have a continued substring in p from the start of p IF (p[i-1] == p[sub]) THEN: sub = sub + 1 T[i] = sub i = i + 1 //if the substring stops, fall back to the start ELSE IF (sub > 0) THEN: sub = T[sub] //we weren't in a substring, so use default value ELSE T[i] = 0 i = i + 1 ENDIF ENDWHILE

I am not going to walk through the whole process. So let's start at i=7 and sub=0.

Starting at the IF-statement, we test to see if p[6] is equal to p[0]. Since 'd' is not equal to 'a' then we move onto the following ELSEIF-statement. Here sub is not greater than 0, so we move onto the final ELSE-statement. With this we make T[7]=0 and increment i to i=8.

Now, starting back at the IF-statement, we test to see if p[7] equals p[0], and by surprise it does since 'a' = 'a'. This means that we increment sub to sub=1, and we make T[8]=1 and increment i again to i=9.

Moving on, we test p[8] to p[1] this time, since sub=1. This time we succeed again since 'b' = 'b'. Therefore we increment sub and i to sub=2 and i=10. And obviously we set T[9]=2.

I'll leave T[10] up to you, as well as the ones I missed, but it should be obvious by now how this works. If you're pondering why we need a –1 at the beginning, it's so that if we fail on the very first character, we simply move along to the next character in the main string. As if this were just 0, we would not slide the pattern along.


The Main Algorithm Seeing as we now understand how the T table is created, the main algorithm for the KMP is very simple.

First of all, we have to remember that we have a pattern to find, p, and a string to search, s. Then of course we have i which is the current index into p, and j which is the current index into s.

Here's the pseudo-code for the KMP:

//while we don't exceed the length of string s WHILE (j + i < length_of_s) DO: //character matching for parallel characters IF (p[i] == s[j+i]) THEN: //reached the end of the pattern, match found IF (i == length_of_p – 1) THEN: RETURN j ENDIF i=i+1 //parallel characters don't match, slide pattern along ELSE j = j + i - T[i] IF (T[i] > –1) i = T[i] ELSE i = 0 ENDIF ENDIF ENDWHILE //we reached the end of the string, match not found RETURN length_of_s

This may look horribly daunting at first, but it really is very simple. The first IF-statement is simply the character-by-character matching process; matching parallel characters from the pattern and the string. If we reach the end of the pattern, return the position in s where the pattern was found. Otherwise, we still have a character-by-character match, so increment to the next character in the pattern for testing.

The real fun beginnings inside of the ELSE-statement. The equation here (j = j + i - T[i]) should look very familiar. Yep, we came across it earlier. This line simply sets the new start position in string s to where we start matching the pattern against it again, and of course, you now know what the T table looks like now.

The next part, and the final IF-statement, is a little more tricky to grasp. Basically, when we slide the pattern along, we don't always need to re-test some of the characters from the pattern. This is because of the way the T table works, and it's called backtracking. For example, if we have the following string and pattern:

  • j = 01234567890
  • s = abcdab abcd
  • p = abcdabd
  • i = 0123456

We can see that the initial start point will fail at s[6] and p[6]. Because of where this failed, T[6] is actually equal to the value 2 (check this yourself). Therefore, we slide the pattern along to j = j + i + T[i] = 0 + 6 – 2 = 4:

  • j = 01234567890
  • s = abcdab abcd
  • p =     abcdabd
  • i =     0123456

But wait a second, we don’t need to compare the first 2 characters of p again, because we know that they already match up due to the T table. Therefore, we start at index i = T[6] = 2. This saves us a lot of time from doing none needed comparisons!

This backtracking only works if the value at T[i] is greater than -1. Otherwise, we just start at index i=0 in the pattern all of the time.


Complexity The best part is that both the pre-processing phase and the main phase run in linear time. Therefore, the construction of the T table takes O(m) time, where m is the length of the pattern, p. The main program itself takes O(n) time where n is the length of the string, s. Therefore the overall time complexity is O(m+n).



Share this:

Comments


  Allowed tags: <a> <b> <strong> <i> <em> <del> [code] [com]