Строительный блокнот  Automata methods and madness 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 [ 36 ] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175

The symbol , (dot) stands for any character. The sequence [ 102 fc] stands for the regular expression

1 + a2 +----\- ak

This notation saves about half the characters, since we dont have to write the H-sigTis, For example, we could express the four characters used in С comparison operators by [<>=! ].

Between the square braces we can put a range of the form x-y to mean all the ciiaracters from x to у in the ASCII sequence. Since the digits have codes in order, as do the upper-case letters and the lower-case letters, we can express many of the classes of characters that we really care about with just a few keystrokes. For example, the digits can be expressed [0-9], the upper-case letters can be expressed [A-2], and the set of all letters and digits can be expressed [A-Za-zO-9]. If we want to include a minus sign among a list of characters, we can place it first or last, so it is not confused with its use to form a character range. For example, the set

3.3 Applications of Regular Expressions

A regular expression that gives a picture of the pattern we want to recognize is the medium of choice for applications that search for patterns in text. The regular expressions are then compiled, behind the scenes, into deterministic or nondeterministic automata, which are then simulated to produce a program that recognizes patterns in text. In this section, we shall consider two important classes of regular-expression-based applications: lexical analyzers and text search.

3.3.1 Regular Expressions in UNIX

Before seeing the applications, we shall introduce the UNIX notation for extended regular expressions. This notation gives us a number of additional capabilities. In fact, the UNIX extensions include certain features, especially the ability to name and refer to previous strings that have matched a pattern, that actually tdlow nonrogular languages to be recognized. We shall not consider these features here; rather wc shall only introduce the shorthands that allow complex regular expressions to be written succinctly.

The first enhancement to the regular-expression notation concerns the fact that most real applications deal with the ASCII cliaiacter set. Our examples have typically used a small alphabet, such as {0,1}. The existence of only two symbols aUowed us to write succinct expressions like 0 -I-1 for any character. However, if there were 128 characters, say, the same expression would involve listing them all, and would be highly inconvenient to write. Thus, UNIX regular expressions allow us to write character classes to represent large sets of characters as succinctly as possible. The rules for character classes are:



The notation [idigit:] has the advantage that should some code other than ASCII be used, including a code where the digits did not liave consecutive codes, [:digit;] would still represent [01234Б6789], while [0-9] would represent whatever characters had codes between the codes for 0 and 9, inclusive.

of digits, plus the dot, plus, and minus signs that are used to form signed decimal numbers may be expressed [-+.0-9]. Square brackets, or other characters that have special meanings in UNIX regular expressions can be represented as characters by preceding them with a backslash {\).

There are special notations for several of the most common classes of characters. For instance;

a) [: digit:] is the set of ten digits, the same as [0-9]

b) [:alpha;] stands for any alphabetic character, as does [A-Za-z].

c) [: alnum: ] stands for the digits and letters (alphabetic tmd numeric characters), as does [A-Za-zO-9].

In addition, there are several operators that are used in UNIX regular expressions that we have not encountered previously. None of these operators extend what languages can be expressed, but they sometimes make it easier to express what we want.

1. The operator is used in place of + to denote union.

2. The operator ? means iro or one of. Thus, Л? in UNIX is the same as e + Я in this books regular-expression notation.

3. The operator + means one or more of. Thus, R+ in UNIX is shorthand for RR* in our notation.

4. The operator {n} means n copies of. Thus, R{b] in UNIX is shorthand for RRRRR.

Note that UNIX regular expressions allow parentheses to group subexpressions, just as for the regular expressions described in Section 3.1.2, and the same operator precedence is used (with ?, + and {n} treated like * as far as precedence is concerned). The star operator * is used in UNIX (without being a superscript, of course) with the same meaning as we have used.

3,3,2 Lexical Analysis

One of the oldest applications of regular expressions wa-s in specifying the component of a compiler called a lexical analyzer. This component scans the source program and recognizes all tokens, those substrings of consecutive characters that belong together logically. Keywords and identifiers are common examples of tokens, but there are many others.



The Complete Story for UNIX Regular Expressions

The reader who wants to get the complete list of operators and shorthands available in the UNIX regular-expression notation can find them in the manual pages for TOrious commands. There are some differences among the various versions of UNIX, but a command like man grep will get you the notation used for the grep command, which is fundamental, Grep stands for Global (search for) Regular Expression and Print, incidentally.

The UNIX command lex and its GNU version flex, accept as input a list of regular expressions, in the UNIX style, each followed by a bracketed section of code that indicates what the lexical analyzer is to do when it finds an instance of that token. Such a facility is called a lexical-analyzer generator, because it takes as input a high-level description of a lexical analyzer and produces from it a function that is a working lexical analyzer.

Commands such as lex and flex have been found extremely useful because the regular-expression notation is exactly as powerful as we need to describe tokens. These commands are able to use the regular-expression-to-DFA conversion process to generate an efficient function that breaks source programs into tokens. They make the implementation of a lexical analyzer an afternoons work, while before the development of these regular-expression-based tools, the hand-generation of the lexical analyzer could take months. Further, if we need to modify the lexical analyzer for any reason, it is often a simple matter to change a regular expression or two, instead of having to go into mysterious cod(; to fix a bug.

Example 3.9: In Fig. 3.19 is an example of partial input to the lex command, describing some of the tokens that are found in the language C. The first line handles the keyword else and the action is to return a symbolic constant (ELSE in this example) to the parser for further processing. The second line contains a regular expression describing identifiers; a letter followed by zero or more letters and/or digits. The action is first to enter that identifier in the symbol table if not already there; lex isolates the token found in a buffer, so this piece of code knows exactly what identifier was found. Finally, the lexical analyzer returns the symbolic constant ID, which has been chosen in this example to represent identifiers.

The third entry in Fig. 3.19 is for the sign >=, a two-character operator. The last example we show is for the sign =, a one-character operator. There would in practice appear expressions describing each of the keywords, each of the signs and punctuation symbols Hke commas and parentheses, and famihes of constants such as numbers and strings. Many of these are very simple, just a sequence of one or more specific characters. However, some have more



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 [ 36 ] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175