Understanding Boyer-Moore
Boyer-Moore string searching algorithm is very elementary string searching algorithm that is probably gone through as a first or second algorithm in every string algorithm course. Even though it is a basic algorithm, it doesn’t mean that it wouldn’t be used in any real-world application. For example, Boyer-Moore is used in GNU Grep for fixed string matching 1 and also in Silver Surfer (well, both use a different BM variant but still BM).
BM starts comparing characters one by one starting from the last character in the pattern. By comparing each character we don’t get much better performance than a trivial algorithm, so BM tries to minimize the character comparisons and not to do any pointless comparisons. Character comparisons can be minimized by shifting the pattern to the right when a mismatch occurs. By starting the comparison from the last character, algorithm can learn more about matched and mismatched characters.
text: ABBABAZ AABBABAB ABACBCBBABAB
pattern: ABBABAB
↑
So here the text is ABBABAZ AABBABAB ABACBCBBABAB
and the pattern that we are
searching from the text is ABBABAB
. The character that the algorithm is
currently comparing is pointed by the ↑
. Character Z
doesn’t occur in the
pattern, and if the last character doesn’t match, there is no point in comparing
any of the preceding characters, so pattern can shifted fully past the Z
character.
text: ABBABAZ AABBABAB ABACBCBBABAB
pattern: ABBABAB
↑
The amount BM shifts the pattern forward is calculated by two rules: bad
character rule and good suffix rule. We saw the bad character rule already in
the previous example. It tells how much the pattern can be shifted given a
character that caused the mismatch. Bad character rule produces a table
consisting of all possible characters that can occur in the text. Each element
tells how much pattern can be shifted when this character caused the
mismatch. Characters that are not in the pattern, e.g. Z
in this case, have
the length of the pattern as the value. Characters that occur in the pattern
have the value of distance from the end of the pattern to the rightmost
occurrence. If the distance is 0
, algorithm would still shift it by 1
since
otherwise we would stay at the same place indefinitely. For example, A
would
have a value of 1
since it is the second character from the end. The full
table for this pattern would be:
Character: A B Other
Value: 1 0 7
Good suffix rule is a bit more complex. It focuses on the already matched characters (i.e. the suffix that was successfully matched). It consists of two cases. First, it tries to align matched suffix to another occurrence of it in the pattern (preceded by a different character) and if this is not possible, it tries to align a suffix of the matched part and a prefix of the pattern. Coming back to the previous example.
text: ABBABAZ AABBABAB ABACBCBBABAB
pattern: ABBABAB
↑
At this point, BAB
is the suffix that is successfully matched since a mismatch
happens when comparing B
to A
. Bad character shift would be now 1
(table
says 0
but we’ll really shift by 1
), but with good suffix rule we could
shift more. If there is another occurrence of the matched suffix somewhere else
in the pattern, that occurrence is aligned to the same substring in the text. But
since we don’t want to do a pointless shift and end up in the same situation
(i.e. successfully match BAB
but then fail when comparing B
to A
), we need
to check that this occurrence is not preceded by the same character that the
matched suffix was. In our case, BAB
is preceded by A
, so now we want to
find BAB
somewhere else in the pattern which preceded by something else. BAB
occurs only in ‘ABBABAB’ and ‘ABBABAB’ which are both preceded by
different characters. So the pattern is aligned as follows:
text: ABBABAZ AABBABAB ABACBCBBABAB
pattern: ABBABAB
↑
Now the pattern is matched and we end up in a situation like this:
text: ABBABAZ AABBABAB ABACBCBBABAB
pattern: ABBABAB
↑
Again, bad character shift would be 1
. Now the good suffix is ABBABAB
,
i.e. the whole pattern but it cannot occur anywhere else in the pattern. But
according to the second case in good suffix rule, we will try to align a suffix
of the matched part and a prefix of the pattern. AB
is both prefix and suffix
of ABBABAB
so we will align ‘ABBABAB’ and ‘ABBABAB’.
text: ABBABAZ AABBABAB ABACBCBBABAB
pattern: ABBABAB
↑
So, we have two rules in BM and neither of them will miss any of the occurrences. Sometimes bad character rule can shift more than good suffix rule and sometimes vice versa. Hence, it is more efficient to calculate the shift from both rules and use the maximum. It would be possible to only implement the other and still be able to match all possible occurrences.
Implementation of the main BM algorithm is quite straightforward. As said before, BM compares characters one by one starting from the end of the pattern and shifts it according to the two rules when mismatch occurs.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
|
This function runs BM algorithm on a given text and pattern, prints positions where matches occur and returns the number of the matches. The first few lines initialize the structures and calculate lengths of the pattern and the text (which could be omitted). Then at the for loop, we start looping through the characters, starting from the end of the pattern and continuing until we have gone through the whole text. The while loop compares the characters in the text and in the pattern until a mismatch occurs (or until we have successfully matched each character so we have a match). After that the position where we read from the text is shifted according to the two rules.
One thing to note though is that when calculating shift from bad character rule, we must take number of matched characters into account. For example, if the situation is:
text: ABBAZABBABAB
pattern: ABBABAB
↑
Value in the bad character shift table would be 7, which would be invalid since valid match would not be seen. Hence we have to subtract matched characters from 7, i.e. actual shift would be 5 and next place for the pattern is:
text: ABBAZABBABAB
pattern: ABBABAB
↑
In addition, when calculating the shift from bad character rule (delta1
), we
have to make sure that if bad_char[text[i]]
is zero, we shift it with 1.
As you can see, the main logic is very simple and it can be done in a few lines of code. Some error checks here are omitted for simplicity, for example, checking that pattern and text are nonempty and non-null and that the pattern is not shorter than the text, but they would be very trivial to add.
1 2 3 4 5 6 7 8 9 10 11 |
|
Let’s move on to the bad character rule. The first for loop initializes the
whole bad character array with the length of the pattern. This way we can
address the characters that are not in the pattern. To recap, if the character
in the text doesn’t occur anywhere in the pattern, we can move the whole pattern
past that. ALPHABET_SIZE
should cover all possible characters in text. Here we
are using CHAR_MAX
which should be enough when dealing with ASCII text.
The second for loop in bad character rule sets the characters that are actually in the pattern. In a case of a mismatch, when the character is in the pattern, we must shift the pattern so that the character in the text and the corresponding character in the pattern are aligned. If there are many occurrences of that character in the pattern, we should align with the rightmost.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Only thing left now is the good suffix rule which is also arguably the most
difficult part. Easiest (but not the most efficient) way to tackle this is to
simply turn the both cases to code. At first, we have few corner cases: if a
mismatch happens on the first character, we can only shift by one since we don’t
really know anything about the text yet. Secondly, since good_suffix_rule
is
not initialized, we should set zeroth element to zero since case 1 cannot apply
to it. Then we build case 1 and case 2 of the good suffix rule.
First case in good suffix rule is to find the matched part somewhere else in the pattern, but it cannot have the same preceding character as the matched part. Easiest way to do this is to search all possible suffixes from the pattern.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
So, start searching the occurrences of the suffixes. For example, if the pattern
is abbabab
, we’ll search occurrences of b
, then ab
, bab
, abab
and so
on. For b
, you’ll find occurrences in ‘abbabab’, but it
may not be preceded by a
(because a
is now the character that caused the
mismatch and shifting with that would bring us to the same situation) and the
rightmost b
doesn’t count (because shift would be 0 which would be
nonsense). Hence, the correct occurrence is ‘abbabab’. When building case 1
for good suffix rule, we feed all these suffixes to find_occurrence
function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Here we try to find the given suffix somewhere from the pattern. It basically
tries to be reverse strstr
, because we are interested in the rightmost
match. In addition, the match may not be preceded by the mismatched character
(parameter preceding
). For example, when calculating good suffix rule for
bab
, where b
is the mismatched character so the good suffix is ab
, this
function makes following comparisons:
text: abbabab
pattern: ab # 'ba' != 'ab'
ab # 'ab' == 'ab', but preceded by 'b'
ab # 'ba' != 'ab'
ab # 'bb' != 'ab'
ab # 'ab' == 'ab', and not preceded by 'b' -> return 0
Last part of the good suffix rule should check if a suffix of the matched
substring is a prefix of the pattern. In other words, check if some rightmost
part of the matched substring is included at the beginning of the pattern. This
case doesn’t apply if previous case applied, so if good_suffix_rule[i]
is
nonzero, we don’t have to calculate anything there. One simple way to implement
this, it is to take each suffix and check if ending of it occurs at the
beginning of the pattern.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
In build_good_suffix_case2
function, we take each suffix and give it to
is_prefix
function. This time the order of the suffixes doesn’t matter so
we’ll start looping from the left.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
Here we start checking if str
and prefix
are the same memory address,
because we are interested only in proper prefixes (proper prefix is not the
string itself). Since we know how this function is used, we may use memory
address comparison here. Then we start decreasing prefix
from the right until
it matches or until all suffixes are checked. For example, when str
is the
same ABBABAB
, and prefix
is BBABAB
, following comparisons are made:
str: abbabab
prefix: bbabab
babab
abab
bab
ab # match -> return '4'
This implementation of the good suffix rule is quite inefficient since we discard a lot of information, e.g. we would only need to calculate possible suffixes once and we could learn those already when calculating the first case. But since we are trying to understand how BM works, I think this approach is justified since it is easier to understand.
You can find the whole implementation here with few unit tests. It should print something like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
-
http://git.savannah.gnu.org/cgit/grep.git/tree/README↩