Строительный блокнот  Automata methods and madness 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 [ 37 ] 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175

else {return(ELSE);}

CA-Za-z][A-Za-z0-9]* -{code to enter the found identifier

in the symbol table; returnСID);

>

>= {return(GE);}

{return(EQ);}

Figure 3,19: A sample of lex input

of the flavor of identifiers, requiring the full power of the regular-expression notation to describe. The integers, floating-point numbers, character strings, and comments are other examples of sets of strings that profit from the regular-expression capabilities of commands like lex. □

The conversion of a collection of expressions, such as those suggested in Fig. 3.19, to an automaton proceeds approximately as we have described formally in the preceding sections. We start by building an automaton for the union of all the expressions. This automaton in principle tells us only that some token has been recognized. However, if we follow the construction of Theorem 3,7 for the union of expressions, the e-NFA state tells us exactly which token has been recognized.

The only problem is that more than one token may be recognized at once; for instance, the string else matches not only the regular expression else but also the expression for identifiers. The standard resolution is for the lexical-analyzer generator to give priority to the first expression listed. Thus, if we want keywords like else to be reserved (not usable as identifiers), we simply list them ahead of the expression for identifiers.

3.3.3 Finding Patterns in Text

In Section 2.4.1 we introduced the notion that automata could be u.sed to search eiBciently for a set of words in a large repository such as the Web. While the tools and technology for doing so are not so well developed as that for lexical analyzers, the regular-expression notation is valuable for describing searches for interesting patterns. As for lexical analyzers, the capability to go from the natural, descriptive regular-expression notation to an efficient (aut(nnaton-based) implementation offers substantial intellectual leverage.



The general problem for which regular-expression technology has been found useful is the description of a vaguely defined class of patterns in text. The vagueness of the description virtually guarantees that we shall not describe the pattern correctly at first - perhaps we can never get exactly the right description. By using regular-expression notation, it becomes easy to describe the patterns at a high level, with little effort, and to modify the description quickly when things go wrong. A compiler for regular expressions is useful to turn the expressions we write into executable code.

Let us explore an extended example of the sort of problem that arises in many Web applications. Suppose that we want to scan a very large number of Web pages and detect addresses. We might simply want to create a mailing list. Or, perhaps we are trying to classify businesses by their location so that we can answer queries like find me a restaurant w-itlun 10 minutes drive of where I am now,

We shall focus on recognizing street addresses in particular. What is a street address? Well have to figure that out, and if, while testing the software, we find we miss some cases, well have to modify the expressions to capture what we were missing. To begin, a street address will probably end in Street or its abbreviation, St. Howe\er, some people live on Avenues or Roads, and these might be abbreviated in the address as well. Thus, we might use as the ending for our regidar expression something like:

Street ISt\.I AvenueAve\.RoadRd\.

In the above expression, we have used UNIX-style notation, with the vertical bar, rather than +, as the union operator. Note also that the dots are escaped with a preceding backslash, since dot has the special meaning of any character in UNIX expressions, and in this case we really want only the period or dot character to end the three abbreviations.

The designation such as Street must be preceded by the name of the street. Usually, the name is a capital letter followed by some lower-case letters. We can describe this pattern by the UNIX expression [A-Z] [a-z] *. However, some streets have a name consisting of more than one word, such as Rhode Island Avenue in Washington DC. Thus, after discovering that we were missing addresses of this form, we could revise our description of street names to be

4A-Z][a-z]*C [A-Z] [a-z]*)*

The expression above starts with a group consisting of a capital and zero or more lower-case letters. There follow zero or more groups consisting of a blank, another capital letter, and zero or more lower-case letters. The blank is an ordinary character in UNIX expressions, but to avoid having the above expression look like two expressions separated by a blank in a UNIX command line, we are required to place quotation marks around the whole expression. The quotes are not part of the expression itself.



Now, we need to include the house number as part of the address. Most house numbers are a string of digits. However, some will have a letter following, as in 123A Main St. Thus, the expression we use for numbers has an optional capital letter following; [0-9] +[A-Z]?, Notice that we use the UNIX -I- operator for one or more digits and the ? operator for zero or one capital letter. The entire expression we have developed for street addresses is:

[0-93 + [A-Z]? [A-Z][a-z]*( [A-Z] [a-z]*)* (Street t St\.I Avenue(Ave\.I RoadRd\.)

If we work with this expression, we shall do fairly well. However, we shall eventually discover that we are missing:

1. Streets that are called something other than a street, avenue, or road. For example, we shall miss Boulevard, Place, Way, and their abbreviations.

2. Street names that are numbers, or partially numbers, like 42nd Street.

3. Post-Office boxes and rural-delivery routes.

4. Street names that dont end in anythuig like Street. An example is El Camino Real in Silicon Valley, Being Spanish for the royal road, saying El Camino Real Road w-ould be redundant, so one has to deal with complete addresses Uke 2000 El Camino Real,

5. All sorts of strange things wc cant even imagine. Can you?

Thus, having a regular-expression compiler can make the process of slow convergence to the complete recognizer for addresses mucii easier than if we had to recode every change directly in a conventional programming language.

3.3.4 Exercises for Section 3-3

! Exercise 3.3.1: Give a regular expression to describe jhone numbers in all the various forms you can think of. Consider international numbers as well as the fact that different countries have different numbers of digits in area codes and in local phone numbers,

!! Exercise 3.3.2: Give a regular expression to represent salaries as they might appear in employment advertising. Consider that salaries might be given on a per hour, week, month, or year basis. They may or may not appear with a dollar sign, or other unit such as K following. There may be a word or words nearby that identify a salary. Suggestion: look at classified ads in a newspaper, or on-line jobs listings to get an idea of what patterns might be useful,

! Exercise 3.3.3: At the end of Section 3.3.3 we gave some examples of improvements that could be possible for the regular expression that describes addresses. Modifj the expression developed there to include all the mentioned options.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 [ 37 ] 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175