Строительный блокнот  Automata methods and madness 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 [ 56 ] 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175

the procnsk of derivntion, whereby it is determined which strings are in the language of the grammar.

5.1.1 An Informal Example

Let us consider the language of palindromes. A palindrome is a string that reads the same forward and backward, such as otto or madamimadaia ( Madam, Im Adam, allegedly the first thing Eve heard in the Garden of Eden). Put another way, string u; is a palindrome if and only if u; = w. To make things simple, we shall consider describing only the palindromes with alphabet (0,1). This language includes strings like 0110, 11011, and t, but not Oil or 0101.

It is easy to verify that the language Lpai of palindromes of Os and Is is not a regular language. To do so, we use the pumping lemma. If Lpai is a r(!gular language, let n be the; associated constant, and consider the palindrome w = 0 10 . If Lpat is regular, then we can break w into w = xyz, such that у consists of one or more Os from the first group. Thus, xz, which would also have to be in L ,; if Lpai were regular, would have fewer Os to the left of the lone 1 thtm there are to the right of the 1. Therefore xz cannot be a palindrome. We have now contradicted the assumption that Lpi is a regular language.

There is a natural, recursive definition of when a string of Os and Is is in Lpai- It starts лvith a basis saying that a few obvious strings are in Lpai, and then exploits the idea that if a string is a palindrome, it nmst begin and end with the same symbol. Further, when the first and last symbols axe removed, the resulting string must also be a palindrome. That is:

BASIS: €, 0, and 1 are palindromes.

INDUCTION: If u- is a pahndrome, so are OwO and 1 j1. No string is a palindrome of Os and Vs. unless it follows from this basis and induction rule.

.A context-free grammar is a formal notation for expressing such recursive definitions of languages. .A grammar consists of one or more variables that represent classes of strings, i.e., languages. In this example we have need for only one variable P, which represents the set of palindromes; that is the class of strings forming the language Lpai. There are rules that say how the strings in each class are constructed. The construction can use symbols of the alphabet, strings that are already known to be in one of the classes, or both.

Example 5.1: The rules that define the palindromes, expressed in the context-free grammar notation, are shown in Fig. 5,1. We shall see in Section 5.1.2 what the rules mean.

The first three rules form the basis. They tell us that the class of palindromes includes the strings e, 0, and 1. None of the right sides of these rules (the portions following the arrows) contains a variable, which is why they form a basis for the definition.

The last two ndos form the inductive part of the definition. For instance, rule 4 says that if we take any string w from the class P, then OwO is also in class F. Rule 5 likewise tells us that Iwl is also in P. □



P -

P -

0

P -

P -

OPO

P -

> IPI

Figure 5.1: A context-free grammar for palindromes

5.1.2 Definition of Context-Free Grammars

There are four important components in a grammatical description of a language:

1. There is a finite set of symbols that form the strings of the language being defined. This set was {0,1} in the palindrome example we just saw. We call this alphabet the terminals, or terminal symbols.

2. There is a finite set of variables, also called sometimes nonterminals or syntactic categories. Each variable represents a language; i.e., a set of strings. In our example above, there was only one variable, P, which we used to represent the class of palindromes over alphabet {0,1}.

3. One of the variables represents the language being defined; it is called the start symbol- Other variables represent auxiliary classc!s of strings that are used to help define the language of the start symbol. In our example, P, the only variable, is the start symbol.

4. There is a finite set of productions or rules that represent the recursive definition of a language. Each production consists of:

(a) A variable that is being (partially) defined by the production. This variable is often called the head of the production.

(b) The production symbol

(c) A string of zero or more terminals and variables. This string, called the body of the production, represents one way to form strings in the language of the variable of the head. In so doing, we leave terminals unchanged and substitute for each variable of the body any string that is known to be in the language of that variable.

We saw an example of productions in Fig. 5.1.

Che four components just described form a context-free grammar, or just grammar, or CFG. We shall represent a CFG G by its four components, that is, G = (F,r, P, S), where V is the set of variables, T the terminals, P the set of productions, and S the start symbol.



->

E + E

Figure 5.2: A context-free grammar for simple expr<;ssions

The grammar for expressions is stated formally as G = ({E,I},T,P,E), where T is the set of symbols {-h, (,), u, i, 0,1} mid P is the .set of productions shown in Fig. 5.2. Wo interpret the productions as follows.

Rule (I) is the basis rule for expressions. It says that an expression can be a single id(;ntifier. Rules (2) through (4) describe the inductive case for expressions. Rule (2) says that an e.xpression can be two expressions connected by a plus sign; rule (3) says the same with a multiplication sign. Rule (4) says

Example 5.2 : The grammar Gpai for the palindromes is represented by

Gp , = {{P},{OA}. A,P)

where A represents the set of five productions that we saw in Fig. 5.1. □

Example 5.3: Let us explore a more complex CFG that represents (a simplification of) expressions in a typical programming language. First, we shall limit ourselves to the operators -I- and *, representing addition and multiplication. We shall allow arguments to be identifiers, but instead of allowing the fufi set of typical identifiers (letters followed by zero or more letters and digits), we shall allow oidy the letters a and b and the digits 0 and 1. Ev(;ry identifier must begui with a or 6, which may be followed by any string in {a,b,0,1}.

W(; need two variables in this grammar. One, which we call E, represents expr(!s.siouH. It is the start symbol and represents the language of expressions we arc d<!fining. Tlic other variabh;, /, represents identifiers. Its language is actually regular; it is the langiuigc of the regular expression

(a + b)(a-bb + 0 + l)*

However, we shall not use regvdar (expressions directly in grammars. Rather, we use a set of productions that say essentially the same thing as this regular expnission.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 [ 56 ] 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175