Строительный блокнот  Automata methods and madness 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 [ 28 ] 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175

1. Search commands such as the UNIX grep or equivalent commands for finding strings that one sees in Web browsers or text-formatting systems. These systems use a regular-expresaion-like notation for describing patterns that the user wants to find in a file. Different search systems convert the regular expression into either a DFA or an NFA, and simulate that automaton on the file being searched.

2, Lexical-analyzer generators, such as Lex or Flex. Recall that a lexical analyzer is the component of a compiler that breaks the source program into logical units (called tokens) of one or more characters that have a shared significance. Examples of tokens include keywords (e.g., while), identifiers (e.g., any letter followed by zero or more letters and/or digits), and signs, such as + or <=. .A lexical-analyzer generator accepts descriptions of the forms of tokens, which are essentially regular expressions, and produces a DFA that recognizes which token appears next on the input.

3.1.1 The Operators of Regular Expressions

Regular expressions denote languages. For a simple example, the regular expression 01*4-10* denotes the language consisting of all strings that are either a single 0 followed by any number of Ts or a single 1 followed by any number of Os. We do not expect you to know at this point how to interpret regular expressions, so our statement about the language of this expression must be accepted on faith for the moment. We shortly shall define all the symbols used in this expression, so you can see why our interpretation of this regular expression is the correct one. Before describing the regular-expression notation, we need to learn the three operations on languages that the operators of regular expressions represent. These operations are:

1, The union of two languages L and M, denoted L U is the set of strings that are in either Lor M, or both. For example, if L - {001,10,111} and M = (e, 001}, then LuM = {e, 10,001, 111}.

2. The concatenation of languages L and M is the set of strings that can be formed by taking any string in L and concatenating it with any string in M. Recall Section 1.5.2, where we defined the concatenation of a pair of strings; one string is followed by the other to form the result of the concatenation. We denote concatenation of languages either with a dot or with no operator at all, although the concatenation operator is frequently called dot. For example, if L - {001,10,111} and M = {e,001}, then b.M, or just LM, is (001,10, 111, 001001,10001,111001}. The first three strings in LM are the strings in L concatenated with e. Since e is the identity for concatenation, the resulting strings are the same as the strings of L. However, the last three strings in LM are formed by taking each string in L and concatenating it with the second string in M, which is 001. For instance, 10 from L concatenated with 001 from M gives us 10001 for LM.



3.1.2 Building Regular Expressions

Algebras of all kinds start with some elementary expressions, usually constants and/or variables. ..Igebras then allow us to construct more expressions by

The term Kloene closure refers to S. C. Kleeno, who originated the regular expression notation and this operator.

3. The closure (or star, or Kleene closure) of a language L is denoted L* and represents the set of those strings that can be formed by taking any number of strings from L, possibly with repetitions (i.e., the same string may be selected more than once) and concatenating all of them. For instance, if L = {0,1}, then L* is all strings of Os and Is. If L = {0,11}, then L* consists of those strings of Os and Is such that the Is come in pairs, e.g., Oil, 11110, and e, but not 01011 or 101. More formally, L* is the infinite union Ui>() 1.% where L = {(), L - L, and L for / > 1 is LL---L (the concatenation of i copies of L).

Exeimple 3.1: Since the idea of the closure of a language is somewhat tricky, let us study a few examples. First, let L - {0,11}. L - {e}, independent of what language L is; the 0th power represents the selection of zero strings from L. - L, which represents the choice of one string from L. Thus, the first two terms in the expansion of L* give us {e, 0,11}.

Next, consider L. We pick two strings from L, with repetitions allowed, so there are four choices. These four selections give us - {00,011,110,1111}. Similarly, L is the set of strings that may be formed by making three choices of the two strings in L and gives us

{000,0011,0110,1100,01111,11011,11110, mill}

To compute L, we nmst compute for each i, and take the union of all these languages, has 2 members. Although each L is finite, the union of the infinite number of terms L is generally an infinite language, as it is in our example.

Now, let L be the set of all strings of Os. Note that L is infinite, unlike our previous example, which is a finite language. However, it is not hard to discover what L* is. L° - {e}, as always. L = L. is the set of strings that can be formed by taking one string of Os and concatenating it with another string of Os. The result is still a string of Os. In fact, every string of Os can be written as the concatenation of two strings of Os (dont forget that e is a string of Os ; this string can always be one of the two strings that we concatenate). Thus, - L. Likewise, L - L, and so on. Thus, the infinite union L* - L и и и - is L in the particular case that the language L the set of all strings of Os.

For a final example, 0* ~ {c}. Note that 0 - {c}, while 0% for any i > 1, is empty, since we cant select any strings from the empty set. In fact, 0 is one of only two languages whose closure is not infinite. □



Use of the Star Operator

We saw the star operator first in Section 1.5.2, where we applied it to an alphabet, e.g., 2*. That operator formed all strings whose symbols were chosen from alphabet S. The closure operator is essentially the same, although there is a subtle distinction of types.

Suppose L is the language containing strings of length 1, and for each sjTnbol о in S there is a string a in L. Then, although L and S look the same, they are of different types; L is a set of strings, and S is a set of sjTubols. On the other hand, L* denotes the same language as S*.

applying a certain set of operators to these elementary expressions and to previously constructed expressions. Usually, some method of grouping operators with their operands, such as parentheses, is required as well. For instance, the familiar arithmetic algebra starts with constants such as integers and real numbers, plus variables, and builds more complex expressions with arithmetic operators such as + and x.

The algebra of regular expressions follows this pattern, using constants and variables that denote languages, and operators for the three operations of Section 3.1.1 -union, dot, and star. We can describe the regular expressions recursively, as follows. In this definition, we not only describe what the legal regular expressions are, but for each regular expression E, we describe the language it represents, which we denote L{E).

BASIS: The basis consists of three parts:

1. The constants e and 0 are regular expressions, denoting the languages {c} and 0, respectively. That is, L{e) - {e}, and L(0) = 0.

2. If a is any symbol, then a is a regular expression. This expression denotes the language {o}. That is, L(a) - {o}. Note that we use boldface font to denote an expression corresponding to a symbol. The correspondence, e.g. that a refers to a, should be obvious.

3. A variable, usually capitalized and italic such as L, is a variable, representing any language.

INDUCTION: There are four parts to the inductive step, one for each of the three operators and one for the introduction of parentheses.

1, If E and F are regular expressions, then E + F is a regular expression denoting the union of Е{Е) and L{F). That is, LiE + F) - L{E) U L{F).

2. If .E and F are regular expressions, then EF is a regular expression denoting the concatenation of L{E) and L(F), That is, L(EF) - L{E)LiF).



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 [ 28 ] 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175