Previous Previous chapter · Next Next chapter · Contents Table of Contents

Chapter 4 : PATTERN MATCHING

4.1 INTRODUCTION

Pattern matching examines a "subject" string for some combination of characters, called a "pattern." The matching process may be very simple, or extremely complex. For example:

  1. The subject contains several color names. The pattern is the string "BLUE". Does the subject string contain the word "BLUE"?
  2. The subject contains a nucleic acid (DNA) sequence. The pattern searches for a subsequence that is replicated in two other places in the string.
  3. The subject contains a paragraph of English text. The pattern describes the spacing rules to be applied after punctuation. Does the subject string conform to the punctuation rules?
  4. The subject string represents the current board position in a game of Tick-Tack-Toe. The pattern examines this string and determines the next move.
  5. The subject contains a program statement from a prototype computer language. The pattern contains the grammar of that language. Is the statement properly formed according to the grammar?

Most programming languages provide rudimentary facilities to examine a string for a specific character sequence. SNOBOL4 patterns are far more powerful, because they can specify complex (and convoluted) interrelationships. The colors of a painting, the words of a sentence, the notes of a musical score have limited significance in isolation. It is their "relationship" with one another which provides meaning to the whole. Likewise, SNOBOL4 patterns can specify "context;" they may be qualified by what precedes or follows them, or by their position in the subject.

4.1.1 Knowns and Unknowns

Patterns are composed of "known" and "unknown" components.

"Knowns" are specific character strings, such as the string "BLUE" in the first example above. We are looking for a yes/no answer to the question: "Does this known item appear in the subject string?"

"Unknowns" specify the "kind" of subject characters we are looking for; the specific characters are not identifiable in advance. We might want to match only characters from a restricted alphabet, or any substring of a certain length, or some arbitrary number of repetitions of a string. If the pattern matches, we can then "capture" the particular subject substring matched.

4.2 SPECIFYING PATTERN MATCHING

A pattern match requires a subject string and a pattern. The subject is the first statement element after the label field (if any). The pattern appears next, separated from the subject by white space (blank or tab). If SUBJECT is the subject string, and PATTERN is the pattern, it looks like this:

    label   SUBJECT PATTERN
The pattern match "succeeds" if the pattern is found in the subject string; otherwise it fails. This success or failure may be tested in the GOTO field:
    label   SUBJECT PATTERN            :S(label1) F(label2)
A real point of confusion is the distinction between pattern matching and concatenation. How do you tell the difference? Where does the subject end and the pattern begin? In this case, parentheses should be placed around the subject, since SNOBOL4 always uses the first "complete" statement element as the subject. In the statement
    X Y Z
X is the subject, and Y concatenated with Z is the pattern. Whereas
    (X Y) Z
indicates the subject is string X concatenated with string Y, and the pattern is Z.

4.3 SUBJECT STRING

The subject string may be a literal string, a variable, or an expression. If it is not a string, its string equivalent will be produced before pattern matching begins. For example, if the subject is the integer 48, integer to string conversion produces the character string "48". Remember, if the subject includes concatenated elements, they should be enclosed in parentheses.

4.4 PATTERN SUBSEQUENTS AND ALTERNATES

Arithmetic expressions are composed of elements and simpler subexpressions. Similarly, patterns are composed of simpler subpatterns which are joined together as "subsequents" and "alternates." If P1 and P2 are two subpatterns, the expression

    P1 P2
is also a pattern. The subject must contain whatever P1 matches, immediately followed by whatever P2 matches. P2 is the "subsequent" of P1. The white space (blank or tab) between P1 and P2 is the same binary concatenation operator previously used to join strings; its use with patterns is completely analogous. The preceding pattern matches pattern P1 "followed by pattern" P2.

The binary "alternation" operator is the vertical bar (|). As it is a binary operator, it must have white space on each side. The pattern

    P1 | P2
matches whatever P1 matches, "or" whatever P2 matches. SNOBOL4 tries the various alternatives from left to right.

Normally, concatenation is performed before alternation, so the pattern

    P1 | P2 P3
matches P1 alone, or P2 "followed by" P3. Parentheses can be used to alter the grouping of subpatterns. For example:
    (P1 | P2) P3
matches P1 "or" P2, followed by P3.

When a pattern successfully matches a portion of the subject, the matching subject characters are "bound" to it. The next pattern in the statement must match beginning with the very next subject character. If a subsequent fails to match, SNOBOL4 backtracks, unbinding patterns until another alternative can be tried. A pattern match fails when SNOBOL4 cannot find an alternative that matches.

The null string may appear in a pattern. It always matches, but does not bind any subject characters. We can think of it as matching the invisible space "between" two subject characters. One possible use is as the last of a series of alternatives. For example, the pattern

    ROOT ('S' | 'ES' | '')
matches the pattern in ROOT, with an optional suffix of 'S' or 'ES'. If ROOT matches, but is not followed by 'S' or 'ES', the null string matches and successfully completes the clause. Its presence gives the pattern match a successful escape.

The conditional functions of the previous chapter may appear in patterns. If they fail when evaluated, the current alternative fails. If they succeed, they match the null string, and so do not consume any subject characters. They behave like a gate, allowing the match to proceed beyond them only if they are true. This pattern will match 'FOX' if N is 1, or 'WOLF' if N is 2:

    EQ(N,1) 'FOX' | EQ(N,2) 'WOLF'
Parentheses may be used to factor a pattern. The strings 'COMPATIBLE', 'COMPREHENSIBLE', and 'COMPRESSIBLE' are matched by the pattern:
    'COMP' ('AT' | 'RE' ('HEN' | 'S') 'S') 'IBLE'

4.5 SIMPLE PATTERN MATCHES

Here are examples of pattern matches using a string literal or variable for the subject. The patterns consist entirely of known elements. Use the CODE.SNO program to experiment with them:

    ?       'BLUEBIRD' 'BIRD'
    Success
    ?       'BLUEBIRD' 'bird'
    Failure
    ?       B = 'THE BLUEBIRD'
    ?       B 'FISH'
    Failure
    ?       B 'FISH' | 'BIRD'
    Success
    ?       B ('GOLD' | 'BLUE') ('FISH' | 'BIRD')
    Success
The first statement shows that the matching substring ('BIRD') need not begin at the start of the subject string. This is called "unanchored" matching. The second statement fails because strings are case sensitive, unlike names and labels. The third statement creates a variable to be used as the subject. The fifth statement employs an alternate: we are matching for 'FISH' or 'BIRD'.

The last statement uses subsequents and alternates. We are looking for a substring in B that contains 'GOLD' or 'BLUE', followed by 'FISH' or 'BIRD'. It will match 'GOLDFISH', 'GOLDBIRD', 'BLUEFISH' or 'BLUEBIRD'. If the parentheses were omitted, concatenation of 'BLUE' and 'FISH' would be performed before alternation, and the pattern would match 'GOLD', 'BLUEFISH', or 'BIRD'.

4.6 THE PATTERN DATA TYPE

If we execute the statement

    ?       COLOR = 'BLUE'
the variable COLOR contains the string 'BLUE', and could appear in the pattern portion of a statement:
    ?       B COLOR
    Success
Even though it is used as a pattern, COLOR has the "string" data type. However, complicated patterns may be stored in a variable just like a string or numeric value. The statement
    ?       COLOR = 'GOLD' | 'BLUE'
will create a "structure" describing the pattern, and store it in the variable COLOR. COLOR now has the "pattern" data type. The preceding example can now be written as:
    ?       CRITTER = 'FISH' | 'BIRD'
    ?       BOTH = COLOR CRITTER
    ?       B BOTH
    Success

4.7 CAPTURING MATCH RESULTS

If the pattern match

    B BOTH
succeeds, we may want to know which of the many pattern alternatives were used in the match. The binary operator "conditional assignment" assigns the matching subject substring to a variable. The operator is called conditional, because assignment occurs ONLY if the entire pattern match is successful. Its graphic symbol is a period (.). It assigns the matching substring on its left to the variable on its right. Note that the direction of assignment is just the opposite of the statement assignment operator (=). Continuing with the previous example, we'll redefine COLOR and CRITTER to use conditional assignment:
    ?       COLOR = ('GOLD' | 'BLUE') . SHADE
    ?       CRITTER = ('FISH' | 'BIRD') . ANIMAL
    ?       BOTH = COLOR CRITTER
    ?       B BOTH
    Success
    ?       OUTPUT = SHADE
    BLUE
    ?       OUTPUT = ANIMAL
    BIRD
The substrings that match the subpatterns COLOR and CRITTER are assigned to SHADE and ANIMAL respectively. The statement
    BOTH = COLOR CRITTER
had to be reexecuted because its previous execution captured the old values of COLOR and CRITTER, without the conditional assignment operators. The redefinition of COLOR and CRITTER was not reflected in BOTH until the statement was reexecuted.

Conditional assignment may appear at any level of pattern nesting, and may include other conditional assignments within its embrace. The pattern

    (('B' | 'F' | 'N') . FIRST 'EA' ('R' | 'T') . LAST) . WORD
matches 'BEAR', 'FEAR', 'NEAR', 'BEAT', 'FEAT', or 'NEAT', assigning the first letter matched to FIRST, the last letter to LAST, and the entire result to WORD.

The variable OUTPUT may be used as the target of conditional assignment. Try:

    ?       'B2' ('A' | 'B') . OUTPUT (1 | 2 | 3) . OUTPUT
    B
    2
    Success

4.8 UNKNOWNS

All of the previous examples used patterns created from literal strings. We may also want to specify the "qualities" of a match component, rather than its specific characters. Using unknowns greatly increases the power of pattern matching. There are two types, primitive patterns and pattern functions.

4.8.1 Primitive Patterns

There are seven primitive patterns built into the SNOBOL4 system. The two used most frequently will be discussed here. Chapter 7, "Advanced Topics," introduces the remaining five.

4.8.2 Cursor Position

During a pattern match, the "cursor" is SNOBOL4's pointer into the subject string. It is integer valued, and points "between" two subject characters. The cursor is set to zero when a pattern match begins, corresponding to a position immediately to the left of the first subject character. As the pattern match proceeds, the cursor moves right and left across the subject to indicate where SNOBOL4 is attempting a match. The value of the cursor will be used by some of the pattern functions that follow.

The "cursor position" operator assigns the current cursor value to a variable. It is a unary operator whose graphic symbol is the "at sign" (@). It appears within a pattern, preceding the name of a variable. By using OUTPUT as the variable, we can display the cursor position on the screen. For instance:

    ?       'VALLEY' 'A' @OUTPUT ARB 'E' @OUTPUT
    2
    5
    Success
    ?       'DOUBT' @OUTPUT 'B'
    0
    1
    2
    3
    Success
    ?       'FIX' @OUTPUT 'B'
    0
    1
    2
    Failure
Cursor assignment is performed whenever the pattern match encounters the operator, including retries. It occurs even if the pattern ultimately fails. The element @OUTPUT behaves like the null string -- it doesn't consume subject characters or interfere with the match in any way.

4.8.3 Integer Pattern Functions

These functions return a pattern based on their integer argument. The pattern produced can be used directly in a pattern match statement, or stored in a variable for later retrieval.

4.8.4 Character Pattern Functions

These functions produce a pattern based on a string-valued argument. Once again, the pattern may be used directly or stored in a variable.

We need to introduce one more fundamental concept -- replacement -- before we can write some meaningful programs.

4.9 PATTERN MATCHING WITH REPLACEMENT

Pattern matching identifies a subject substring with a particular trait, specified by the pattern. We used conditional assignment to copy that substring to a variable. Replacement moves in the other direction, letting you alter the substring in the subject. The space occupied by the matching substring may be enlarged or contracted (or removed entirely), leaving adjacent subject characters undisturbed. If the pattern matched the entire subject, replacement behaves like a simple assignment statement.

Replacement appears in a form similar to assignment:

    SUBJECT PATTERN = REPLACEMENT
First, the pattern match is attempted on the subject. If it fails, execution of the statement ends immediately, and replacement does not occur. If the match succeeds, any conditional assignments within the pattern are performed. The replacement field is then evaluated, converted to a string, and inserted in the subject in place of the matching substring. If the replacement field is empty, the null string replaces the matched substring, effectively deleting it. Try a few examples with CODE.SNO:
    ?       T = 'MUCH ADO ABOUT NOTHING'
    ?       T 'ADO' = 'FUSS'
    Success
    ?       OUTPUT = T
    MUCH FUSS ABOUT NOTHING
    ?       T 'NOTHING' =
    Success
    ?       OUTPUT = T
    MUCH FUSS ABOUT
    ?       'MASH' 'M' = 'B'
    Execution error #8, Variable not present where required
    Failure
The first replacement searches for 'ADO' in the subject string, replacing it with 'FUSS'. The second replacement has a null string replacement value, and deletes the matching substring. The last example demonstrates that a variable must be the subject of replacement. Variables can be changed; string literals -- like 'MASH' -- cannot.

The following will replace the 'M' in 'MASH' with a 'B':

    ?       VERB = 'MASH'
    ?       VERB 'M' = 'B'
    ?       OUTPUT = VERB
    BASH
If the matched substring appears more than once in the subject, only the first occurrence is changed. The remaining substrings must be found with a program loop. For example, a statement to eliminate all occurrences of the letter 'A' from the subject looks like this:
    ALOOP   SUBJECT 'A' =                   :S(ALOOP)
Here ALOOP is the statement label, SUBJECT is some variable containing the subject string, 'A' is the pattern, and the replacement field is empty. If an 'A' is found, it is deleted by replacing it with the null string, and the statement succeeds. The success GOTO branches back to ALOOP, and another search for 'A' is performed. The loop continues until no 'A's remain in the subject, and the pattern match fails. Of course, the pattern and replacement can be as complex as desired.

Simple loops like this can be tried in CODE.SNO by appending a semicolon after the GOTO field. (Semicolon is used with GOTO in CODE.SNO only; you would not use it in normal programs.) Continuing with the previous example:

    ?       VOWEL = ANY('AEIOU')
    ?VL     T VOWEL = '*'                   :S(VL);
    ?       OUTPUT = T
    M*CH F*SS *B**T
Since conditional assignment is performed before replacement, its results are available for use in the replacement field of the same statement. Here's an example of removing the first item from a list, and placing it on the end:
    ?       RB = 'RED,ORANGE,YELLOW,GREEN,BLUE,INDIGO,VIOLET,'
    ?       CYCLE = BREAK(',') . ITEM LEN(1) REM . REST
    ?       RB CYCLE = REST ITEM ','
    Success
    ?       OUTPUT = ITEM
    RED
    ?       OUTPUT = RB
    ORANGE,YELLOW,GREEN,BLUE,INDIGO,VIOLET,RED,
Pattern CYCLE matches the entire subject, placing the first color into ITEM, bypassing the comma with LEN(1), and placing the remainder of the subject into REST. REST and ITEM are then transposed in the replacement field, and stored back into RB.

4.10 SAMPLE PROGRAMS

I've introduced a lot of concepts in this chapter; it's time to see how they fit together into programs. They're supplied on the Vanilla SNOBOL4 diskette.

4.10.1 Word Counting

The first program counts the number of words in the input file. Lines with an asterisk in the first column are comment lines -- their contents are ignored by SNOBOL4.

    *   Simple word counting program, WORDS.SNO.
    *
    *   A word is defined to be a contiguous run of letters,
    *   digits, apostrophe and hyphen.  This definition of
    *   legal letters in a word can be altered for specialized
    *   text.
    *
    *   If the file to be counted is TEXT.IN, run this program
    *   by typing:
    *       B>SNOBOL4 WORDS /I=TEXT
    *
            &TRIM  =  1
            WORD   =  "'-"  '0123456789' &UCASE &LCASE
            WPAT   =  BREAK(WORD) SPAN(WORD)

    NEXTL   LINE   =  INPUT                      :F(DONE)
    NEXTW   LINE WPAT =                          :F(NEXTL)
            N      =  N + 1                      :(NEXTW)

    DONE    OUTPUT =  +N ' words'
    END
After defining the acceptable characters in a word, the real work of the program is performed in the three lines beginning with label NEXTL. A line is read from the input file, and stored in variable LINE. The next statement attempts to find the next word with pattern WPAT. BREAK streams by any blanks and punctuation, stopping just short of the word, which SPAN then matches. Both the word and any preceding punctuation are removed from LINE by replacement with the null string.

When no more words remain in LINE, the failure transfer to NEXTL reads the next line. If the match succeeds, N is incremented, and the program goes back to NEXTW to search for another word. When the End-of-File is encountered, control transfers to DONE and the number of words is displayed.

It's simple to alter pattern WPAT to search for other things. For instance, if we wanted to count occurrences of double vowels, we could use:

    WPAT = ANY('AEIOUaeiou') ANY('AEIOUaeiou')
To count the occurrences of integers with an optional sign character, use:
    WPAT = (ANY('+-') | '') SPAN('0123456789')
Perhaps we want to count violations of simple punctuation rules: period with only one blank, or comma and semicolon followed by more than one blank:
    WPAT = '. ' NOTANY(' ') | ANY(',;') ' ' SPAN(' ')
Notice how closely WPAT parallels the English language description of the problem.

4.10.2 Word Crossing

This program asks for two words, and displays all intersecting letters between them. For example, given the words LOOM and HOME, the program output is:

     H
    LOOM
     M
     E

      H
    LOOM
      M
      E

       H
       O
    LOOM
       E
A pattern match like this would find the first intersecting character:
    HORIZONTAL ANY(VERTICAL) . CHAR
However, we want to find all intersections, so will have to iterate our search. In conventional programming languages, we might use numerical indices to remember which combinations were tried. Here, we'll use place-holding characters like '*' and '#' to remove solutions from future consideration. As seems to be the case with SNOBOL4, there are more comments than program statements:
    * CROSS.SNO - Print all intersections between two words

            &TRIM = 1

    *  Get words from user
    *
    AGAIN   OUTPUT = 'ENTER HORIZONTAL WORD:'
            H      = INPUT                       :F(END)

            OUTPUT = 'ENTER VERTICAL WORD:'
            V      = INPUT                       :F(END)

    *       Make copy of horizontal word to track position.
            HC     = H

    *  Find next intersection in horizontal word.  Save
    *  the number of preceding horizontal characters in NH.
    *  Save the intersecting character in CROSS.
    *  Replace with '*' to remove from further consideration.
    *  Go to AGAIN to get new words if horizontal exhausted.
    *
    NEXTH   HC @NH ANY(V) . CROSS = '*'          :F(AGAIN)

    *  For each horizontal hit, iterate over possible
    *  vertical ones.  Make copy of vertical word to track
    *  vertical position.
    *
            VC     = V

    *  Find where the intersection was in the vertical word.
    *  Save the number of preceding vertical characters in NV.
    *  Replace with '#' to prevent finding it again in that
    *  position.  When vertical exhausted, try horizontal again.
    *
    NEXTV   VC @NV CROSS = '#'                   :F(NEXTH)

    *  Now display this particular intersection.
    *  We make a copy of the original vertical word,
    *  and mark the intersecting position with '#'.
    *
            OUTPUT =
            PRINTV = V
            PRINTV POS(NV) LEN(1) = '#'

    *  Peel off the vertical characters one-by-one.  Each will
    *  be displayed with NH leading blanks to get it in the
    *  correct column.  When the '#' is found, display the full
    *  horizontal word instead.
    *  When done, go to NEXTV to try another vertical position.
    *
    PRINT   PRINTV LEN(1) . C =                   :F(NEXTV)
            OUTPUT = DIFFER(C,'#') DUPL(' ',NH) C :S(PRINT)
            OUTPUT = H                            :(PRINT)
    END

4.11 ANCHORED AND UNANCHORED MATCHING

Most of the examples above match substrings which do not begin at the first subject character. This is the "unanchored" mode of pattern matching. Alternately, we can "anchor" the pattern match by requiring it to include the first subject character. Setting keyword &ANCHOR to a nonzero value produces anchored matching. Anchored matching is usually faster than unanchored, because many futile attempts to match are eliminated.

Even when the desired item is not at the beginning of the subject, it is often possible to simulate anchored matching by prefixing the pattern with a subpattern which will stream out to the desired object. The stream function spans the gap from the first subject character to the desired item. Use CODE.SNO to experiment with &ANCHOR:

    ?       DIGITS = '0123456789'
    ?       &ANCHOR = 1
    ?       'THE NEXT 43 DAYS' BREAK(DIGITS) SPAN(DIGITS) . N
This will assign substring '43' to N, even in anchored mode. In unanchored mode, the test lines:
    ?       &ANCHOR = 0
    ?       'THE NEXT 43 DAYS' SPAN(DIGITS) . N
would ultimately succeed, but only after SPAN failed on each of the characters preceding the '4'. The efficiency difference is more pronounced if the subject does not contain any digits. In the first formulation, BREAK(DIGITS) fails and the anchored match then fails immediately. The second construction fails only after SPAN is tried at each subject character position.

When your program first begins execution, SNOBOL4 sets keyword &ANCHOR to zero, the unanchored mode. If you can construct all your patterns as anchored patterns, you should set &ANCHOR nonzero for anchored matching. Setting and resetting &ANCHOR throughout your program is error prone and not advised. Another alternative is to leave &ANCHOR set to 0, but to 'pseudo-anchor' patterns by using POS(0) as the first pattern element.

It always takes less time for a pattern to succeed than to fail. Failure implies an exhaustive search of all combinations, whereas success stops the pattern match early. You should try to construct patterns with direct routes to success, such as the use of BREAK above. Wherever possible, impose restrictions on the number of alternatives to be tried. Combinatorial explosion is the price of loose pattern programming.


Previous Previous chapter · Next Next chapter · Contents Table of Contents