Understanding Boyer-Moore

Boyer-Moore string searching algorithm is very elementary string searching algorithm that is probably gone through as a first or second algorithm in every string algorithm course. Even though it is a basic algorithm, it doesn’t mean that it wouldn’t be used in any real-world application. For example, Boyer-Moore is used in GNU Grep for fixed string matching ¹ and also in Silver Surfer (well, both use a different BM variant but still BM).

BM starts comparing characters one by one starting from the last character in the pattern. By comparing each character we don’t get much better performance than a trivial algorithm, so BM tries to minimize the character comparisons and not to do any pointless comparisons. Character comparisons can be minimized by shifting the pattern to the right when a mismatch occurs. By starting the comparison from the last character, algorithm can learn more about matched and mismatched characters.

text:    ABBABAZ AABBABAB ABACBCBBABAB
pattern: ABBABAB
               ↑

So here the text is ABBABAZ AABBABAB ABACBCBBABAB and the pattern that we are searching from the text is ABBABAB. The character that the algorithm is currently comparing is pointed by the ↑. Character Z doesn’t occur in the pattern, and if the last character doesn’t match, there is no point in comparing any of the preceding characters, so pattern can shifted fully past the Z character.

text:    ABBABAZ AABBABAB ABACBCBBABAB
pattern:        ABBABAB
                      ↑

The amount BM shifts the pattern forward is calculated by two rules: bad character rule and good suffix rule. We saw the bad character rule already in the previous example. It tells how much the pattern can be shifted given a character that caused the mismatch. Bad character rule produces a table consisting of all possible characters that can occur in the text. Each element tells how much pattern can be shifted when this character caused the mismatch. Characters that are not in the pattern, e.g. Z in this case, have the length of the pattern as the value. Characters that occur in the pattern have the value of distance from the end of the pattern to the rightmost occurrence. If the distance is 0, algorithm would still shift it by 1 since otherwise we would stay at the same place indefinitely. For example, A would have a value of 1 since it is the second character from the end. The full table for this pattern would be:

Character:    A B Other
Value:        1 0 7

Good suffix rule is a bit more complex. It focuses on the already matched characters (i.e. the suffix that was successfully matched). It consists of two cases. First, it tries to align matched suffix to another occurrence of it in the pattern (preceded by a different character) and if this is not possible, it tries to align a suffix of the matched part and a prefix of the pattern. Coming back to the previous example.

text:    ABBABAZ AABBABAB ABACBCBBABAB
pattern:        ABBABAB
                   ↑

At this point, BAB is the suffix that is successfully matched since a mismatch happens when comparing B to A. Bad character shift would be now 1 (table says 0 but we’ll really shift by 1), but with good suffix rule we could shift more. If there is another occurrence of the matched suffix somewhere else in the pattern, that occurrence is aligned to the same substring in the text. But since we don’t want to do a pointless shift and end up in the same situation (i.e. successfully match BAB but then fail when comparing B to A), we need to check that this occurrence is not preceded by the same character that the matched suffix was. In our case, BAB is preceded by A, so now we want to find BAB somewhere else in the pattern which preceded by something else. BAB occurs only in ‘ABBABAB’ and ‘ABBABAB’ which are both preceded by different characters. So the pattern is aligned as follows:

text:    ABBABAZ AABBABAB ABACBCBBABAB
pattern:          ABBABAB
                        ↑

Now the pattern is matched and we end up in a situation like this:

text:    ABBABAZ AABBABAB ABACBCBBABAB
pattern:          ABBABAB
                 ↑

Again, bad character shift would be 1. Now the good suffix is ABBABAB, i.e. the whole pattern but it cannot occur anywhere else in the pattern. But according to the second case in good suffix rule, we will try to align a suffix of the matched part and a prefix of the pattern. AB is both prefix and suffix of ABBABAB so we will align ‘ABBABAB’ and ‘ABBABAB’.

text:    ABBABAZ AABBABAB ABACBCBBABAB
pattern:               ABBABAB
                             ↑

So, we have two rules in BM and neither of them will miss any of the occurrences. Sometimes bad character rule can shift more than good suffix rule and sometimes vice versa. Hence, it is more efficient to calculate the shift from both rules and use the maximum. It would be possible to only implement the other and still be able to match all possible occurrences.

Implementation of the main BM algorithm is quite straightforward. As said before, BM compares characters one by one starting from the end of the pattern and shifts it according to the two rules when mismatch occurs.

uint32_t boyer_moore(const char *text, const char *pattern)
{
    uint32_t matches = 0;

    const size_t pattern_len = strlen(pattern);
    const size_t text_len = strlen(text);

    uint32_t bad_char[ALPHABET_SIZE];
    uint32_t good_suffix[pattern_len + 1];

    build_bad_char_array(bad_char, pattern, pattern_len);
    build_good_suffix_array(good_suffix, pattern, pattern_len);

    for (uint32_t i = pattern_len - 1; i < text_len;)
    {
        uint32_t j = pattern_len - 1;
        while (text[i] == pattern[j]) {
            if (j == 0) {
                printf("match %u at %d\n", matches, i);
                matches++;
                break;
            }
            i--;
            j--;
        }

        const uint32_t matched_chars = pattern_len - 1 - j;
        const uint32_t delta1 = bad_char[text[i]] <= matched_chars ?
            1 : bad_char[text[i]] - matched_chars;
        const uint32_t delta2 = good_suffix[j + 1];

        const uint32_t shift = MAX(delta1, delta2);

        i += matched_chars + shift;
    }

    return matches;
}

This function runs BM algorithm on a given text and pattern, prints positions where matches occur and returns the number of the matches. The first few lines initialize the structures and calculate lengths of the pattern and the text (which could be omitted). Then at the for loop, we start looping through the characters, starting from the end of the pattern and continuing until we have gone through the whole text. The while loop compares the characters in the text and in the pattern until a mismatch occurs (or until we have successfully matched each character so we have a match). After that the position where we read from the text is shifted according to the two rules.

One thing to note though is that when calculating shift from bad character rule, we must take number of matched characters into account. For example, if the situation is:

text:    ABBAZABBABAB
pattern: ABBABAB
             ↑

Value in the bad character shift table would be 7, which would be invalid since valid match would not be seen. Hence we have to subtract matched characters from 7, i.e. actual shift would be 5 and next place for the pattern is:

text:    ABBAZABBABAB
pattern:      ABBABAB
                    ↑

In addition, when calculating the shift from bad character rule (delta1), we have to make sure that if bad_char[text[i]] is zero, we shift it with 1.

As you can see, the main logic is very simple and it can be done in a few lines of code. Some error checks here are omitted for simplicity, for example, checking that pattern and text are nonempty and non-null and that the pattern is not shorter than the text, but they would be very trivial to add.

void build_bad_char_array(uint32_t *bad_char_array, const char *pattern,
                          const size_t pattern_len)
{
    for (uint32_t i = 0; i < ALPHABET_SIZE; i++) {
        bad_char_array[i] = pattern_len;
    }

    for (uint32_t i = 0; i < pattern_len; i++) {
        bad_char_array[pattern[i]] = pattern_len - i - 1;
    }
}

Let’s move on to the bad character rule. The first for loop initializes the whole bad character array with the length of the pattern. This way we can address the characters that are not in the pattern. To recap, if the character in the text doesn’t occur anywhere in the pattern, we can move the whole pattern past that. ALPHABET_SIZE should cover all possible characters in text. Here we are using CHAR_MAX which should be enough when dealing with ASCII text.

The second for loop in bad character rule sets the characters that are actually in the pattern. In a case of a mismatch, when the character is in the pattern, we must shift the pattern so that the character in the text and the corresponding character in the pattern are aligned. If there are many occurrences of that character in the pattern, we should align with the rightmost.

void build_good_suffix_array(uint32_t *good_suffix_array, const char *pattern,
                             const size_t pattern_len)
{
    /* If the mismatch happens at the first character, we can only move by
     * one. */
    good_suffix_array[pattern_len] = 1;
    /* Case 1 doesn't apply on the first element, so initialize it to zero. */
    good_suffix_array[0] = 0;

    build_good_suffix_case1(good_suffix_array, pattern, pattern_len);
    build_good_suffix_case2(good_suffix_array, pattern, pattern_len);
}

Only thing left now is the good suffix rule which is also arguably the most difficult part. Easiest (but not the most efficient) way to tackle this is to simply turn the both cases to code. At first, we have few corner cases: if a mismatch happens on the first character, we can only shift by one since we don’t really know anything about the text yet. Secondly, since good_suffix_rule is not initialized, we should set zeroth element to zero since case 1 cannot apply to it. Then we build case 1 and case 2 of the good suffix rule.

First case in good suffix rule is to find the matched part somewhere else in the pattern, but it cannot have the same preceding character as the matched part. Easiest way to do this is to search all possible suffixes from the pattern.

static void build_good_suffix_case1(uint32_t *good_suffix_array, const char *pattern,
                                    const size_t pattern_len)
{
    /* Case 1: matched string occurs in the pattern. */
    for (uint32_t i = 1; i < pattern_len; i++) {
        uint32_t matched_start_idx = pattern_len - i;
        int m = find_occurrence(pattern, matched_start_idx,
                                pattern + matched_start_idx, i,
                                pattern[matched_start_idx - 1]);
        if (m >= 0) {
            good_suffix_array[pattern_len - i] = matched_start_idx - m;
        } else {
            good_suffix_array[pattern_len - i] = 0;
        }
    }
}

So, start searching the occurrences of the suffixes. For example, if the pattern is abbabab, we’ll search occurrences of b, then ab, bab, abab and so on. For b, you’ll find occurrences in ‘abbabab’, but it may not be preceded by a (because a is now the character that caused the mismatch and shifting with that would bring us to the same situation) and the rightmost b doesn’t count (because shift would be 0 which would be nonsense). Hence, the correct occurrence is ‘abbabab’. When building case 1 for good suffix rule, we feed all these suffixes to find_occurrence function.

static int find_occurrence(const char *haystack, const size_t comparison_start,
                           const char *needle, const size_t needle_len,
                           const char preceding)
{
    for (int i = comparison_start - 1; i >= 0; i--) {
        int match = strncmp(haystack + i, needle, needle_len);
        if (match == 0) {
            if (i == 0 || haystack[i - 1] != preceding) {
                return i;
            }
        }
    }
    return -1;
}

Here we try to find the given suffix somewhere from the pattern. It basically tries to be reverse strstr, because we are interested in the rightmost match. In addition, the match may not be preceded by the mismatched character (parameter preceding). For example, when calculating good suffix rule for bab, where b is the mismatched character so the good suffix is ab, this function makes following comparisons:

text:    abbabab
pattern:     ab    # 'ba' != 'ab'
            ab     # 'ab' == 'ab', but preceded by 'b'
           ab      # 'ba' != 'ab'
          ab       # 'bb' != 'ab'
         ab        # 'ab' == 'ab', and not preceded by 'b' -> return 0

Last part of the good suffix rule should check if a suffix of the matched substring is a prefix of the pattern. In other words, check if some rightmost part of the matched substring is included at the beginning of the pattern. This case doesn’t apply if previous case applied, so if good_suffix_rule[i] is nonzero, we don’t have to calculate anything there. One simple way to implement this, it is to take each suffix and check if ending of it occurs at the beginning of the pattern.

static void build_good_suffix_case2(uint32_t *good_suffix_array, const char *pattern,
                                    const size_t pattern_len)
{
    /* Case 2: suffix of a matched string is a prefix of the pattern. */
    for (uint32_t i = 0; i < pattern_len; i++) {
        if (good_suffix_array[i] != 0) {
            /* If case 2 is applies only if case 1 didn't. */
            continue;
        }

        int m = is_prefix(pattern, pattern + i, pattern_len - i);
        if (m > 0) {
            good_suffix_array[i] = i + m;
        } else {
            good_suffix_array[i] = pattern_len;
        }
    }
}

In build_good_suffix_case2 function, we take each suffix and give it to is_prefix function. This time the order of the suffixes doesn’t matter so we’ll start looping from the left.

static int is_prefix(const char *str, const char *prefix,
                     const uint32_t prefix_len)
{
    uint32_t i = 0;

    /* Prefix must be a proper prefix, so if str and prefix are the same then
     * start comparing from the first character. */
    if (str == prefix) {
        i = 1;
    }

    for (; i < prefix_len; i++) {
        int match = strncmp(str, prefix + i, prefix_len - i);
        if (match == 0) {
            return i;
        }
    }

    return -1;
}

Here we start checking if str and prefix are the same memory address, because we are interested only in proper prefixes (proper prefix is not the string itself). Since we know how this function is used, we may use memory address comparison here. Then we start decreasing prefix from the right until it matches or until all suffixes are checked. For example, when str is the same ABBABAB, and prefix is BBABAB, following comparisons are made:

str:     abbabab
prefix:  bbabab
         babab
         abab
         bab
         ab        # match -> return '4'

This implementation of the good suffix rule is quite inefficient since we discard a lot of information, e.g. we would only need to calculate possible suffixes once and we could learn those already when calculating the first case. But since we are trying to understand how BM works, I think this approach is justified since it is easier to understand.

You can find the whole implementation here with few unit tests. It should print something like this:

$ gcc -std=c99 -Wall -Wextra -pedantic -Wno-char-subscripts bm.c && ./a.exe
searching 'bar' from 'foo'
searching 'foo' from 'fo'
searching 'fooooo' from 'foooob'
searching 'foo' from 'foo'
match 0 at 0
searching 'foo' from 'barfoo'
match 0 at 3
searching 'foobar' from 'barfoo rfoobarr foobar'
match 0 at 8
match 1 at 16
searching 'foobar' from 'foobafoobafoobafoobafrboofarfoobar'
match 0 at 28
searching 'WOOHOO' from 'HOOWOOWOOHOOWOOWOO'
match 0 at 6
searching 'LALLILLA' from 'SALLILAILLATAVANLALLILLALLALALLI'
match 0 at 16

http://git.savannah.gnu.org/cgit/grep.git/tree/README↩

Aatos Jalo

Understanding Boyer-Moore

Comments