CATEGORII DOCUMENTE |

Bulgara | Ceha slovaca | Croata | Engleza | Estona | Finlandeza | Franceza |

Germana | Italiana | Letona | Lituaniana | Maghiara | Olandeza | Poloneza |

Sarba | Slovena | Spaniola | Suedeza | Turca | Ucraineana |

+ Font mai mare | - Font mai mic

Finding all occurrences of a pattern in a text is a problem that arises frequently in text-editing programs. Typically, the text is a document being edited, and the pattern searched for is a particular word supplied by the user. Efficient algorithms for this problem can greatly aid the responsiveness of the text-editing program. String-matching algorithms are also used, for example, to search for particular patterns in DNA sequences.

We formalize the ** string-matching
problem** as follows. We assume that the text is an array

We say that pattern *P* *occurs
with shift*** s** in text

This chapter is organized
as follows. In Section 1 we review the naive brute-force algorithm for the
string-matching problem, which has worst-case running time *O*((*n* -
*m* + 1)*m*). Section 2 presents an interesting string-matching
algorithm, due to Rabin and Karp. This algorithm also has worst-case running
time *O*((*n* - *m* + 1)*m*), but it works much better on
average and in practice. It also generalizes nicely to other pattern-matching
problems. Section 3 then describes a string-matching algorithm that begins by
constructing a finite automaton specifically designed to search for occurrences
of the given pattern *P* in a text. This algorithm runs in time *O*(*n*
+ *m* ). The
similar but much cleverer Knuth-Morris-Pratt (or KMP) algorithm is presented in
Section 4; the KMP algorithm runs in time *O*(*n* + *m*).
Finally, Section 5 describes an algorithm due to Boyer and Moore that is often
the best practical choice, although its worst-case running time (like that of
the Rabin-Karp algorithm) is no better than that of the naive string-matching
algorithm.

Notation and terminology

We shall let * (read
'sigma-star') denote the set of all finite-length strings formed
using characters from the alphabet . In this
chapter, we consider only finite-length strings. The zero-length ** empty
string**, denoted , also belongs
to *. The
length of a string

We say
that a string *w* is a ** prefix** of a string

Lemma 1

** Proof **See Figure 2 for a graphical proof.

For brevity of notation,
we shall denote the *k*-character prefix *P*[1 . . *k*] of the
pattern *P*[1 . . *m*] by *P _{k}*. Thus,

In our pseudocode, we
allow two equal-length strings to be compared for equality as a primitive
operation. If the strings are compared from left to right and the comparison
stops when a mismatch is discovered, we assume that the time taken by such a
test is a linear function of the number of matching characters discovered. To
be precise, the test '*x* = *y*' is assumed to take time (*t *+
1), where *t* is the length of the longest string *z* such that .

The naive algorithm finds all valid shifts using a
loop that checks the condition *P*[1 . . *m*] = *T*[*s + 1 . .
s + m*] for each of the *n* - *m* + 1 possible values of *s*.

n

2

for

then print 'Pattern occurs with shift'

The naive string-matching
procedure can be interpreted graphically as sliding a 'template'
containing the pattern over the text, noting for which shifts all of the
characters on the template equal the corresponding characters in the text, as
illustrated in Figure 3. The **for** loop beginning on line 3 considers each
possible shift explicitly. The test on line 4 determines whether the current
shift is valid or not; this test involves an implicit loop to check
corresponding character positions until all positions match successfully or a
mismatch is found. Line 5 prints out each valid shift *s*.

Procedure NAIVE-STRING MATCHER takes time ((*n* -
*m* + 1)*m*) in the worst case. For example, consider the text string
a* ^{n}* (a string of

As we shall see, NAIVE-STRING MATCHER is not an optimal
procedure for this problem. Indeed, in this chapter we shall show an algorithm
with a worst-case running time of *O*(*n *+ *m*). The naive
string-matcher is inefficient because information gained about the text for one
value of *s* is totally ignored in considering other values of *s*.
Such information can be very valuable, however. For example, if *P* = aaab and we find that *s*
= 0 is valid, then none of the shifts 1, 2, or 3 are valid, since *T*[4] =
b. In the following sections, we examine several ways to make effective use of
this sort of information.

Show the comparisons the
naive string matcher makes for the pattern *P* = in the text *T* =

Show that the worst-case
time for the naive string matcher to find the *first* occurrence of a
pattern in a text is ((*n* -
*m* + 1)(*m* - 1)).

Suppose that all
characters in the pattern *P* are different. Show how to accelerate NAIVE-STRING MATCHER to run in time *O*(*n*)
on an *n*-character text *T*.

Suppose that pattern *P*
and text *T* are *randomly* chosen strings of length *m* and *n*,
respectively, from the *d*-ary alphabet * _{d}*
= ,

(Assume that the naive algorithm stops comparing characters for a given shift once a mismatch is found or the entire pattern is matched.) Thus, for randomly chosen strings, the naive algorithm is quite efficient.

Suppose we allow the pattern *P* to contain
occurrences of a ** gap character **that can
match an

Note that the gap character
may occur an arbitrary number of times in the pattern but is assumed not to
occur at all in the text. Give a polynomial-time algorithm to determine if such
a pattern *P* occurs in a given text *T*, and analyze the running
time of your algorithm.

Rabin and Karp have proposed a string-matching
algorithm that performs well in practice and that also generalizes to other
algorithms for related problems, such as two-dimensional pattern matching. The
worst-case running time of the Rabin-Karp algorithm is *O*((*n* - *m*
+ 1)*m*), but it has a good average-case running time.

This algorithm makes use of elementary number-theoretic notions such as the equivalence of two numbers modulo a third number. You may want to refer to Section 33.1 for the relevant definitions.

For expository purposes,
let us assume that = , so that each character is a
decimal digit. (In the general case, we can assume that each character is a
digit in radix-*d* notation, where *d* = .) We can
then view a string of *k* consecutive characters as representing a length-*k*
decimal number. The character string thus corresponds to the decimal number 31,415. Given the dual
interpretation of the input characters as both graphical symbols and digits, we
find it convenient in this section to denote them as we would digits, in our
standard text font.

Given a pattern *P*[1
. . *m*], we let *p* denote its corresponding decimal value. In a
similar manner, given a text *T*[1 . . *n*], we let *t _{s}*
denote the decimal value of the length-

We can compute *p*
in time *O*(*m*) using Horner's rule (see Section 32.1):

The value *t*_{0}
can be similarly computed from *T*[1 . . *m*] in time O(*m*).

To compute the remaining
values *t*_{1}, *t*_{2}, . . . , *t _{n-m}*
in time

For example, if *m*=
5 and *t _{s}* = 31415, then we wish to remove the high-order digit

Subtracting 10* ^{m}*-1

The only difficulty with
this procedure is that *p* and *t _{s}* may be too large to
work with conveniently. If

In general, with a *d*-ary
alphabet , we choose *q* so that *d q *fits
within a computer word and adjust the recurrence equation (1) to work modulo *q*,
so that it becomes

where *h* *d ^{m}*

The following procedure
makes these ideas precise. The inputs to the procedure are the text *T*,
the pattern *P*, the radix *d* to use (which is typically taken to be
||), and the
prime *q* to use.

1

2

3

4

5

6

do

t

9

then if

then 'Pattern occurs with shift'

if

then

The procedure RABIN KARP MATCHER works as follows. All
characters are interpreted as radix-*d* digits. The subscripts on *t*
are provided only for clarity; the program works correctly if all the
subscripts are dropped. Line 3 initializes *h* to the value of the
high-order digit position of an *m*-digit window. Lines 4-8 compute *p*
as the value of *P*[1 . . *m*] mod *q *and *t*_{0}
as the value of *T*[1 . . *m*] mod *q*. The **for** loop
beginning on line 9 iterates through all possible shifts *s*. The loop has
the following invariant: whenever line 10 is executed, *t _{s}* =

The running time of RABIN KARP MATCHER is ((*n* -
*m* + 1)*m*) in the worst case, since (like the naive string-matching
algorithm) the Rabin-Karp algorithm explicitly verifies every valid shift. If *P*
= a* ^{m}* and

In many applications, we
expect few valid shifts (perhaps *O*(1) of them), and so the expected
running time of the algorithm is *O*(*n* + *m*) plus the time
required to process spurious hits. We can base a heuristic analysis on the
assumption that reducing values modulo *q* acts like a random mapping from
* to **Z*** _{q}*.
(See the discussion on the use of division for hashing in Section 12.3.1. It is
difficult to formalize and prove such an assumption, although one viable
approach is to assume that

where *v* is the
number of valid shifts. This running time is *O*(*n*) if we choose *q*
*m*.
That is, if the expected number of valid shifts is small (*O*(1)) and the
prime *q* is chosen to be larger than the length of the pattern, then we
can expect the Rabin-Karp procedure to run in time *O*(*n* + *m*).

Working modulo *q* =
11, how many spurious hits does the Rabin-Karp matcher encounter in the text *T*
= 3141592653589793 when looking for the pattern *P* = 26?

How would you extend the
Rabin-Karp method to the problem of searching a text string for an occurrence
of any one of a given set of *k* patterns?

Show how to extend the
Rabin-Karp method to handle the problem of looking for a given *m* X *m*
pattern in an *n* X *n* array of characters. (The pattern may be
shifted vertically and horizontally, but it may not be rotated.)

and Bob similarly
evaluates *B*(*x*). Prove that if *A* *B*,
there is at most one chance in 1000 that *A*(*x*) = *B*(*x*),
whereas if the two files are the same, *A*(*x*) is necessarily the
same as *B*(*x*). (*Hint*: See Exercise 33.4-4.)

We begin this section with the definition of a finite automaton. We then examine a special string-matching automaton and show how it can be used to find occurrences of a pattern in a text. This discussion includes details on how to simulate the behavior of a string-matching automaton on a given text. Finally, we shall show how to construct the string-matching automaton for a given input pattern.

A *finite automaton**M*
is a 5-tuple (*Q, q*_{0}, *A*, , ), where

*Q* is a finite set of **states**,

*q*_{ } *Q* is
the **start state**,

*A* *Q* is
a distinguished set of **accepting states**,

is a finite **input
alphabet**,

is a
function from *Q* X into *Q*,
called the **transition function** of *M*.

The finite automaton begins in
state *q*_{0} and reads the characters of its input string one at
a time. If the automaton is in state *q* and reads input character *a*,
it moves ('makes a transition') from state *q* to state (*q, a*).
Whenever its current state *q* is a member of *A*, the machine *M*
is said to have ** accepted** the string read so far. An input that is
not accepted is said to be

A finite automaton *M* induces a function
, called the ** final-state function**, from * to

(

There is a string-matching automaton for every pattern
*P*; this automaton must be constructed from the pattern in a
preprocessing step before it can be used to search the text string. Figure 6
illustrates this construction for the pattern *P* = ababaca. From now on, we shall
assume that *P* is a given fixed pattern string; for brevity, we shall not
indicate the dependence upon *P* in our notation.

The suffix function is well
defined since the empty string *P*_{0} = is a suffix
of every string. As examples, for the pattern *P* = ab, we have *(*) = 0, (ccaca) = 1, and (ccab) = 2. For a pattern *P*
of length *m*, we have *(*x*)
= m* if and only if . It follows
from the definition of the suffix function that if , then *(*x*)
*(*y*).

We define the
string-matching automaton corresponding to a given pattern P[1 . . *m*] as
follows.

The state set *Q* is . The start state *q*_{0}
is state 0, and state* m* is the only accepting state.

The transition function is defined
by the following equation, for any state *q* and character *a*:

Here is an intuitive
rationale for defining *(*q, a*)
= *(*P _{q
}a*). The machine maintains as an invariant of its operation that

this result is proved as
Theorem 4 below. In words, this means that after scanning the first* i*
characters of the text string *T*, the machine is in state (*T _{i}*)

In the string-matching
automaton of Figure 6, for example, we have (5, b) = 4. This follows from the
fact that if the automaton reads a b in state *q* = 5, then *P _{q}*b ababab, and the longest prefix of

To clarify the operation
of a string-matching automaton, we now give a simple, efficient program for
simulating the behavior of such an automaton (represented by its transition
function ) in finding
occurrences of a pattern *P* of length *m* in an input text *T*[1
. . *n*]. As for any string-matching automaton for a pattern of length *m*,
the state set *Q* is , the start state is 0, and
the only accepting state is state *m*.

1

2

3

if

print 'Pattern occurs with shift'

The simple loop structure
of FINITE AUTOMATON MATCHER implies that its
running time on a text string of length *n* is *O*(*n*). This
running time, however, does not include the time required to compute the
transition function . We address
this problem later, after proving that the procedure FINITE AUTOMATON-MATCHER operates correctly.

Consider the operation of
the automaton on an input text *T*[1 . . *n*]. We shall prove that
the automaton is in state (*T _{j}*)
after scanning character

Lemma 2

For any string *x* and character *a***,**
we have *(*xa*)*
*(*x*)
*+ *1*.

** Proof **Referring to Figure 7, let

Lemma 3

For any string *x* and character *a*,
if *q* = (*x*),
then (*xa*)
= (*P _{q}a*).

** Proof** From the definition of , we have . As Figure 8
shows, . If we let

We are now ready to prove our main theorem characterizing the behavior of a string-matching automaton on a given input text. As noted above, this theorem shows that the automaton is merely keeping track, at each step, of the longest prefix of the pattern that is a suffix of what has been read so far.

Theorem 4

If is the
final-state function of a string-matching automaton for a given pattern *P*
and *T*[1 . . *n*] is an input text for the automaton, then

** Proof **The proof is by induction on

Now, we assume that and prove
that . Let *q*
denote , and let *a*
denote *T*[*i* + 1]. Then,

By induction, the theorem is proved.

By Theorem 4, if the
machine enters state *q* on line 4, then *q* is the largest value
such that . Thus, we
have *q = m* on line 5 if and only if an occurrence of the pattern *P*
has just been scanned. We conclude that FINITE AUTOMATON MATCHER
operates correctly.

The following procedure computes the
transition function *from a
given pattern *P*[1 . . *m*].*

1

2

8

This procedure computes *(*q, a*)
in a straightforward manner according to its definition. The nested loops
beginning on lines 2 and 3 consider all states *q* and characters *a*,
and lines 4-7 set *(*q, a*)
to be the largest *k* such that . The code
starts with the largest conceivable value of *k*, which is min(*m, q*
+ 1), and decreases *k* until .

The running time of COMPUTE TRANSITION FUNCTION is *O*(*m*^{3}), because
the outer loops contribute a factor of *m *, the inner **repeat**
loop can run at most *m* + 1 times, and the test on line 6
can require comparing up to *m* characters. Much faster procedures exist;
the time required to compute from *P*
can be improved to *O*(*m*) by
utilizing some cleverly computed information about the pattern *P* (see
Exercise 4-6). With this improved procedure for computing , the total
running time to find all occurrences of a length-*m* pattern in a length-*n*
text over an alphabet is *O(n*
+ *m*).

Construct the string-matching
automaton for the pattern *P* = aabab and illustrate its operation on the text string *T* = aaababaabaababaab

Draw a state-transition diagram for a string-matching automaton for the pattern ababbabbababbababbabb over the alphabet = .

We call a pattern *P* ** nonoverlappable**
if implies

Given two patterns *P* and *P*',
describe how to construct a finite automaton that determines all occurrences of
*either* pattern. Try to minimize the number of states in your automaton.

Given a pattern *P* containing gap
characters (see Exercise 1-5), show how to build a finite automaton that can
find an occurrence of *P* in a text *T *in *O*(*n*) time,
where *n* = *T*.

We now present a linear-time
string-matching algorithm due to Knuth, Morris, and Pratt. Their algorithm
achieves a (*n + m*)
running time by avoiding the computation of the transition function altogether,
and it does the pattern matching using just an auxiliary function [1 . . *m*]
precomputed from the pattern in time *O*(*m*). The array allows the
transition function to be computed
efficiently (in an amortized sense) 'on the fly' as needed. Roughly
speaking, for any state *q* = 0, 1, . . . *, m,*and any character *a*
, the value [*q*]
contains the information that is independent of *a* and is needed to
compute (*q, a*).
(This remark will be clarified shortly.) Since the array has only *m*
entries, whereas has *O*(*m*
) entries,
we save a factor of in the
preprocessing by computing rather than *.*

The prefix function for a pattern encapsulates knowledge about how the pattern matches against shifts of itself. This information can be used to avoid testing useless shifts in the naive pattern-matching algorithm or to avoid the precomputation of for a string-matching automaton.

Consider the operation of
the naive string matcher. Figure 9(a) shows a particular shift *s* of a template
containing the pattern *P* = ababaca against a text *T*. For this
example, *q* = 5 of the characters have matched successfully, but the 6th
pattern character fails to match the corresponding text character. The
information that *q* characters have matched successfully determines the
corresponding text characters. Knowing these *q* text characters allows us
to determine immediately that certain shifts are invalid. In the example of the
figure, the shift *s* + 1 is necessarily invalid, since the first pattern
character, an a, would be aligned with a text character that is known to match
with the second pattern character, a b. The shift *s* + 2 shown in part
(b) of the figure, however, aligns the first three pattern characters with
three text characters that must necessarily match. In general, it is useful to
know the answer to the following question:

Given that pattern
characters *P*[1 . . *q*] match text characters *T*[*s* + 1
. . *s* + *q*], what is the least shift *s'* > *s* such
that

where *s*' *+ k*
= *s + q*?

Such a shift *s*' is
the first shift greater than *s* that is not necessarily invalid due to
our knowledge of *T*[*s* + 1 . . *s* + *q*]. In the best
case, we have that *s'* = *s* + *q*, and shifts *s* + 1, *s*
+ 2, . . . , *s* + *q* - 1 are all immediately ruled out. In any
case, at the new shift *s*'* *we don't need to compare the first *k*
characters of *P* with the corresponding characters of *T*, since we
are guaranteed that they match by equation (5).

The necessary information
can be precomputed by comparing the pattern against itself, as illustrated in
Figure 9(*c*). Since *T*[*s*'* + 1 . . s'* + *k*] is
part of the known portion of the text, it is a suffix of the string *P _{q}*.
Equation (5) can therefore be interpreted as asking for the largest

We formalize the
precomputation required as follows. Given a pattern *P*[1 . . *m*],
the ** prefix function** for the pattern

That is, [*q*]
is the length of the longest prefix of *P* that is a proper suffix of *P _{q}*.
As another example, Figure 10(a) gives the complete prefix function for the
pattern ababababca

The Knuth-Morris-Pratt matching algorithm is given in pseudocode below as the procedure KMP-MATCHER. It is mostly modeled after FINITE AUTOMATON MATCHER, as we shall see. KMP-MATCHER calls the auxiliary procedure COMPUTE PREFIX FUNCTION to compute .

KMP-MATCHER(1

2

3 COMPUTE-PREFIX-FUNCTION(

4

5

do while

do

if

then

if

then print 'Pattern occurs with shift'

q [

COMPUTE-PREFIX-FUNCTION(

1

2 [1] 0

3

4

5

do

if

then

[

10

We begin with an analysis of the running times of these procedures. Proving these procedures correct will be more complicated.

The running time of COMPUTE PREFIX FUNCTION is *O*(*m*),
using an amortized analysis (see Chapter 18). We associate a potential of *k*
with the current state *k* of the algorithm. This potential has an initial
value of 0, by line 3. Line 6 decreases *k* whenever it is executed, since
[*k*]
< *k*. Since [*k*] 0 for all *k*,
however, *k* can never become negative. The only other line that affects *k*
is line 8, which increases *k* by at most one during each execution of the
**for** loop body. Since *k* < *q* upon entering the **for**
loop, and since *q* is incremented in each iteration of the **for**
loop body, *k* < *q* always holds. (This justifies the claim that [*q*]
< *q* as well, by line 9.) We can pay for each execution of the **while**
loop body on line 6 with the corresponding decrease in the potential function,
since [*k*]
< *k*. Line 8 increases the potential function by at most one, so that
the amortized cost of the loop body on lines 5-9 is *O*(1). Since the
number of outer-loop iterations is *O*(*m*), and since the final
potential function is at least as great as the initial potential function, the
total actual worst-case running time of COMPUTE PREFIX FUNCTION is *O*(*m*).

The Knuth-Morris-Pratt
algorithm runs in time *O*(*m + n*). The call of COMPUTE PREFIX FUNCTION takes *O*(*m*)
time as we have just seen, and a similar amortized analysis, using the value of
*q* as the potential function, shows that the remainder of KMP-MATCHER takes *O*(*n*)
time.

Compared to FlNITE AUTOMATON MATCHER, by using rather than , we have
reduced the time for preprocessing the pattern from *O*(*m* ||) to *O*(*m*),
while keeping the actual matching time bounded by *O*(*m + n*).

We start with an essential
lemma showing that by iterating the prefix function , we can
enumerate all the prefixes *P _{k}* that are suffixes of a given
prefix

where * ^{i}*[

Lemma 5

Let *P* be a pattern of length *m*
with prefix function . Then, for *q*
= 1, 2, . . . , *m*, we have .

** Proof **We first prove that

If *i* *[*q*],
then *i* = * ^{u}*[

We prove that by
contradiction. Suppose to the contrary that there is an integer in the set , and let j
be the largest such value. Because *q* is in , we have *j
< q*, and so we let *j'* denote the smallest integer in *[*q*]
that is greater than *j*. (We can choose *j'* = *q* if there is
no other number in *[*q*]
that is greater than *j*.) We have* *because because *j'*
*[*q*];
thus, by Lemma 1.
Moreover, *j* is the largest such value with this property. Therefore, we
must have [*j'*]
= *j* and thus *j* *[*q*].
This contradiction proves the lemma.

Figure 10 illustrates this lemma.

The algorithm COMPUTE PREFIX FUNCTION computes [*q*]
in order for *q* = 1, 2, . . . , *m*. The computation of [1] = 0 in
line 2 of COMPUTE PREFIX FUNCTION is certainly correct,
since [*q*]
< *q* for all *q*. The following lemma and its corollary will be
used to prove that COMPUTE PREFIX FUNCTION computes [*q*]
correctly for *q* > 1.

Lemma 6

Let *P* be a pattern
of length *m*, and let *be the
prefix function for *P*. For *q* = 1, 2, . . . , *m*, if *[*q*]
> 0, then [q*] - 1*
**[*q* -
1].*

** Proof **If

For *q* = 2, 3, . .
. , *m*, define the subset *E _{q}*

The set *E _{q}*

Corollary 7

Let *P* be a pattern
of length *m*, and let be the
prefix function for *P*. For *q* = 2, 3, . . . , *m*,

** Proof **If

But the set maximized
over is just *E _{q}*-1, so that

We now finish the proof
that COMPUTE PREFIX FUNCTION computes correctly.
In the procedure COMPUTE PREFIX FUNCTION, at the start of each
iteration of the **for** loop of lines 4-9, we have that *k* = [*q* -
1]. This condition is enforced by lines 2 and 3 when the loop is first entered,
and it remains true in each successive iteration because of line 9. Lines 5-8
adjust k so that it now becomes the correct value of [*q*].
The loop on lines 5-6 searches through all values *k* *[*q* -
1] until one is found for which *P*[*k* + 1] = *P*[*q*]; at
that point, we have that *k* is the largest value in the set *E _{q}*-1,
so that, by Corollary 7, we can set [

The procedure KMP-MATCHER can be viewed as a
reimplementation of the procedure FINlTE AUTOMATON MATCHER.
Specifically, we shall prove that the code on lines 6-9 of KMP-MATCHER is equivalent to line 4
of FINITE AUTOMATON MATCHER, which sets *q* to
(*q*,*T*[*i*]).
Instead of using a stored value of (*q*, *T*[*i*]),
however, this value is recomputed as necessary from . Once we
have argued that KMP-MATCHER simulates the behavior of FINITE AUTOMATON MATCHER,
the correctness of KMP-MATCHER follows from the correctness of FINITE AUTOMATON MATCHER
(though we shall see in a moment why line 12 in KMP-MATCHER is necessary).

The correctness of KMP-MATCHER follows from the claim
that either (*q*, *T*[*i*])
= 0 or else (*q*, *T*[*i*])
- 1 *[*q*].
To check this claim, let *k* = (*q*, *T*[*i*]).
Then, by the
definitions of and . Therefore,
either *k* = 0 or else *k* 1 and by dropping
the last character from both *P _{k}* and

The claim is used as
follows. Let *q'* denote the value of *q* when line 6 is entered. We
use the equivalence to justify
the iteration *q* [*q*]
that enumerates the elements of . Lines 6-9
determine *(*q*'*,
*T*[*i*]) by examining the elements of *[*q'*]
in decreasing order. The code uses the claim to begin with *q* = (*T _{i }*-
1) = (

Line 12 is necessary in
KMP-MATCHER
to avoid a possible reference to *P*[*m* + 1] on line 6 after an
occurrence of *P* has been found. (The argument that *q* = (*T _{i }*-
1) upon the next execution of line 6 remains valid by the hint given in
Exercise 4-6: (

Compute the prefix function for the pattern ababbabbababbababbabb when the alphabet is = .

Give an upper bound on
the size of *[*q*]
as a function of *q*. Give an example to show that your bound is tight.

Explain how to determine
the occurrences of pattern *P* in the text *T* by examining the function for
the string *PT* (the string of length *m* + *n* that is the concatenation
of *P* and *T*).

Show how to improve KMP-MATCHER by replacing the
occurrence of in line 7
(but not line 12) by *'*,
where *'* is
defined recursively for *q* = 1, 2, . . . , *m* by the equation

Explain why the modified algorithm is correct, and explain in what sense this modification constitutes an improvement.

Give a linear-time algorithm to determine if a text *T*
is a cyclic rotation of another string *T'*. For example, arc and car are cyclic
rotations of each other.

If the pattern *P* is relatively long and the
alphabet is
reasonably large, then an algorithm due to Robert S. Boyer and J. Strother
Moore is likely to be the most efficient string-matching algorithm.

1

2

3 COMPUTE-LAST-OCCURRENCE-FUNCTION(

4 COMPUTE-GOOD-SUFFIX-FUNCTION(

5

6

s

Aside from the
mysterious-looking 's and 's, this
program looks remarkably like the naive string-matching algorithm. Indeed,
suppose we comment out lines 3-4 and replace the updating of *s* on lines
12-13 with simple incrementations as follows:

The modified program now
acts exactly like the naive string matcher: the **while** loop beginning on
line 6 considers each of the *n - m* + 1 possible shifts *s* in turn,
and the **while** loop beginning on line 8 tests the condition *P*[1 .
. *m*] = *T*[*s* + 1 . . *s* + *m*] by comparing *P*[*j*]
with *T*[*s *+ *j*] for *j* = *m*, *m* - 1, . . .
, 1. If the loop terminates with *j* = 0, a valid shift *s* has been
found, and line 11 prints out the value of *s*. At this level, the only
remarkable features of the Boyer-Moore algorithm are that it compares the
pattern against the text *from right to left* and that it increases the
shift *s* on lines 12-13 by a value that is not necessarily 1.

The Boyer-Moore algorithm
incorporates two heuristics that allow it to avoid much of the work that our
previous string-matching algorithms performed. These heuristics are so
effective that they often allow the algorithm to skip altogether the
examination of many text characters. These heuristics, known as the
'bad-character heuristic' and the 'good-suffix heuristic,'
are illustrated in Figure 11. They can be viewed as operating independently in
parallel. When a mismatch occurs, each heuristic proposes an amount by which *s*
can safely be increased without missing a valid shift. The Boyer-Moore
algorithm chooses the larger amount and increases *s* by that amount: when
line 13 is reached after a mismatch, the bad-character heuristic proposes
increasing *s* by *j* - [*T*[*s*
+ *j*]], and the good-suffix heuristic proposes increasing *s* by [*j*].

When a mismatch occurs, the bad-character
heuristic uses information about where the bad text character *T*[*s*
+ *j*] occurs in the pattern (if it occurs at all) to propose a new shift.
In the best case, the mismatch occurs on the first comparison (*P*[*m*]
*T*[*s*
+ *m*]) and the bad character *T*[*s* + *m*] does not occur
in the pattern at all. (Imagine searching for a* ^{m }*in the text string b

In general, the ** bad-character
heuristic** works as follows. Suppose we have just found a mismatch:

*k* =
0: As shown in Figure 12(a), the bad character *T*[*s* + *j*]
didn't occur in the pattern at all, and so we can safely increase *s* by *j*
without missing any valid shifts.

*k*
< *j*: As shown in Figure 12(b), the rightmost occurrence of the bad
character is in the pattern to the left of position *j*, so that *j*
- *k* > 0 and the pattern must be moved *j - k* characters to the
right before the bad text character matches any pattern character. Therefore,
we can safely increase *s* by *j - k* without missing any valid
shifts.

*k*
> *j*: As shown in Figure 12(c), *j - k* < 0, and so the
bad-character heuristic is essentially proposing to decrease *s*. This
recommendation will be ignored by the Boyer-Moore algorithm, since the
good-suffix heuristic will propose a shift to the right in all cases.

The following simple program defines [*a*]
to be the index of the rightmost position in the pattern at which character *a*
occurs, for each *a* . If *a*
does not occur in the pattern, then [*a*]
is set to 0. We call the ** last-occurrence
function** for the pattern. With this definition, the expression

1

3

5

The running time of
procedure COMPUTE LAST OCCURRENCE FUNCTION is *O*(|| + *m*).

Let us define the relation *Q *~*
R* (read '*Q* is similar to *R'*) for strings *Q *and
*R* to mean that . If two
strings are similar, then we can align them with their rightmost characters
matched, and no pair of aligned characters will disagree. The relation
'~' is symmetric: *Q *~* R* if and only if *R *~* Q*.
We also have, as a consequence of Lemma 1, that

If we find that P[*j*]
*T*[*s
+ j*], where j < *m*, then the ** good-suffix heuristic**
says that we can safely advance

We now show how to
compute the good-suffix function . We first
observe that [*j*] *m* - [*m*]
for all *j*, as follows. If *w* = [*m*],
then by the
definition of .
Furthermore, since for any *j*,
we have *P _{w}* ~

We can now rewrite our definition of as

[The condition that *P*[*j*
+ 1 . . *m*]* ~ P _{k}* holds if either . But the
latter possibility implies that and thus
that

(The second set may be
empty.) It is worth observing that the definition implies that [*j*]
> 0 for all *j* = 1, 2, . . . , *m*, which ensures that the
Boyer-Moore algorithm makes progress.

To simplify the
expression for further, we
define *P*' as the reverse of the pattern *P* and ' as the
corresponding prefix function. That is *P'*[*i*] = *P*[*m*
- *i* + 1] for *i* = 1, 2, . . . , *m*, and '[*t*]
is the largest *u* such that *u* < *t *and .

If* k* is the
largest possible value such that , then we
claim that

where *l = *(*m -
k*) + (*m - j*). To see that this claim is well defined, note that implies that
*m - j **k*,
and thus *l **m*.
Also, *j *<* m *and* k **m*, so
that *l* 1. We prove
this claim as follows. Since . Therefore,
'[*l*] *m - j*.
Suppose now that *p > m - j*, where *p = **'*[*l*].
Then, by the definition of ', we have or,
equivalently, *P*'*[1* . . p*]* = P*'*[*l - p + *1 *.
. l*]. Rewriting this equation in terms of *P* rather than *P'*,
we have *P*[*m - p + *1 *. . m*] = *P*[*m - l + *1 *.
. m - l + p*]. Substituting for *l = 2m - k - j*, we obtain *P*[*m
- p + *1* . . m*]* = P*[*k - m + j + *1 *. . k - m + j +p*],
which implies . Since *p
> m - j*, we have *j + *1* > m-p+*1, and so , implying
that by the
transitivity of . Finally,
since *p > m - j*, we have *k' > k*, where *k' = k - m + j +
p*, contradicting our choice of *k* as the largest possible value such
that . This
contradiction means that we can't have *p > m - j*, and hus *= m - j*,
which proves the claim (8).

Using equation (8), and
noting that '[*l*]*
= m - j* implies that *j = m - **'*[*l*]
and *k = m - l + **'*[*l*],
we can rewrite our definition of still
further:

)

= min(

) .

Again, the second set may be empty.

We are now ready to examine the procedure for computing .

COMPUTE-GOOD-SUFFIX-FUNCTION(1 COMPUTE-PREFIX-FUNCTION(

2

3 ' COMPUTE-PREFIX-FUNCTION(

4

6

do

10

The procedure COMPUTE GOOD SUFFIX FUNCTION is a straightforward
implementation of equation (9). Its running time is *O*(*m*).

The worst-case running
time of the Boyer-Moore algorithm is clearly *O*((*n - m* +1)*m*
+ ||), since COMPUTE LAST OCCURRENCE FUNCTION takes time *O*(*m*
+ ||), COMPUTE GOOD SUFFIX FUNCTION takes time *O*(*m*),
and the Boyer-Moore algorithm (like the Rabin-Karp algorithm) spends *O*(*m*)
time validating each valid shift *s*. In practice, however, the
Boyer-Moore algorithm is often the algorithm of choice.

Compute the and functions
for the pattern *P* = and the alphabet = .

Give examples to show that by combining the bad-character and good-suffix heuristics, the Boyer-Moore algorithm can perform much better than if it used just the good-suffix heuristic.

An improvement to the basic Boyer-Moore procedure that is often used in practice is to replace the function by ', defined by

'[In addition to ensuring that the characters in the good suffix will be mis-matched at the new shift, the ' function also guarantees that the same pattern character will not be matched up against the bad text character. Show how to compute the ' function efficiently.

34-1 String matching based on repetition factors

Let *y ^{i}* denote the concatenation of
string

*a** *Give an efficient algorithm that takes as
input a pattern *P*[1 . . *m*] and computes (*P _{i}*)
for

** b**. For any pattern

**c**. Argue that the following string-matching algorithm correctly finds
all occurrences of pattern *P* in a text *T*[1 . . *n*] in time *O*(*(*P*)*n*
+ *m*).

1

2

3

4

5

6

do if

then

if

then print 'Pattern occurs with shift'

if

This algorithm is due to
Galil and Seiferas. By extending these ideas greatly, they obtain a linear-time
string-matching algorithm that uses only *O*(1) storage beyond what is
required for *P* and *T*.

34-2 Parallel string matching

Consider the problem of
string matching on a parallel computer. Assume that for a given pattern, we
have a string-matching automaton *M* with state set *Q*. Let be the
final-state function for *M*. Suppose that our input ext is *T*[1 . .
*n*]. We wish to compute (*T _{i}*)
for

For any input string *x*,
define the function * _{x}*
:

** a**. Prove that denotes
functional composition:

** b**. Argue that is an
associative operation.

** c**.
Argue that

** d**. Prove that (

** e**. Show how to find all
occurrences of a pattern in a text of length

The relation of string matching to the theory of finite
automata is discussed by Aho, Hopcroft, and Ullman [4]. The Knuth-Morris-Pratt
algorithm [125] was invented independently by Knuth and Pratt and by Morris;
they published their work jointly. The Rabin-Karp algorithm was proposed by
Rabin and Karp [117], and the Boyer-Moore algorithm is due to Boyer and Moore
[32]. Galil and Seiferas [78] give an interesting deterministic linear-time
string-matching algorithm that uses only *O*(1) space beyond that required
to store the pattern and text.

Politica de confidentialitate | Termeni si conditii de utilizare |

Vizualizari: 3099

Importanta:

Termeni si conditii de utilizare | Contact

© SCRIGROUP 2024 . All rights reserved