Строительный блокнот  Automata methods and madness 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 [ 65 ] 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175

Char

-¥ a 1 A 1

Text

e i Char Text

e 1 Element Doc

Element

-J. Text I

<EM> Doc. </EM> 1

<P> Doc 1

<0L> List </0L> 1

Listltem

-> <LI> Doc

List

f. 1 Listltem List

Figure 5.13: Part of an HTML gi-ammar

these tags be matched by </P> and </LI> at the ends of paragraphs and hst items, but it does not require the matching. We have therefore left the matching tags off, to provide some complexity to the sample HTML grammar we shall develop. □

There are a number of classes of strings that are associated with an HTML document. We shall not try to list them all, but here are the ones essential to the understanding of text like that of Example 5.22. For each class, we shall introduce a variable with a descriptive name.

1. Text is any string of characters that can be literally interpreted; i.e., it has no tags. An example of a Text element in Fig 5.12(a) is Moldy bread.

2. Char is any string consisting of a single character that is legal in HTML text. Note that blanks are included as characters.

3. Doc represents documents, which are sequences of elements, We define elements next, and that definition is mutually recursive with the definition of a Doc.

4. Element is either a Text string, or a pair of matching tags and the document between them, or an unmatched tag followed by a document.

5. Listltem is the <LI> tag followed by a document, which is a single Hst item,

6. List is a sequence of zero or more hst items.



Figure 5.13 is a CFG that describe.? as much of the structure of the HTML language as we have covered. In line (1) it is .suggested that a character can be a or A or many other possible characters that are part of the HTML character set. Line (2) says, using two productions, that Text can be either the emjjty stiing, or any legal character followed by more text. Put another way, Text is zero or more characters. Note that < and > are not legal characters, although they can be represented by the sequences ult; and &gt;, respectively. Tbuy, wc cannot accidentally get a. tag into Text.

Line (3) says that a document is a sequence of zero or more elements. .An element in turn, we learn at line (4), is either text, an emphasized document, a paragraph-begirmiiig followed by a document, or a list. We have als(j suggested that there are other productions for Element, corresponding to the other kinds of tags that appear in HTML. Then, in line (5) we find that a list item is the <LI> tag followed by any document, and line (6) tells us that a list is a sequence of zero or more list elements.

Some aspects of HTML do not require the power uf t:(;ntej:t-free grammars; regular expressions are adequate. For example, lines (1) and (2) of Fig, 5.13 simply say that Text represents the same language as does the regular expression (a + A -h )* However, some aspects of HTML do require the power of CFGs. For instance, each pair of tags that are a corresponding beginning and ending pair, e.g., <EM> and </EM>, are like balanced parentheses, which we already know are not regular,

5.3.4 XML and Document-Type Definitions

The fact that HTML is described by a grammar is not in itself remarkable. Essentially all programming languages can be described by their own CFGs, so it would be пизге surprising if wc could not so describe HTML. However, when we look at another important markup language, XML {extensible Markup Language), we find that the CFGs play a more vital role, as part of the process of using that language.

Tlie purpose of XML is not to describe the foruiatthig of the document; that is the job for HTML. Rather, XML tries to describe the semantics of the text. For example, text like 12 .Maple St. looks like an address, but is it? In X.ML, tags would surround a phrase that represented an address; for example:

<ADDR>12 Maple St.</ADDR>

Howeer, it is not immediately obvious that <ADDR> means the addres.s of a building. For instance, if the document were about memory allocation, we might expect that the <ADDR> tag would refer to a memory address. To make clear what the different kinds of tags are, and what structures may appear between matching pairs of these tags, people with a common interest are expected to develop standards in the form of a DTD (Document-Type Definition).



A DTD is essfiiitially a context-free grammar, with its own notation for describing the variables and productions. In the next example, we shall show a simple DTD and introduce some of the language used for describing DTDs, The DTD language itself has a context-free grammar, but it is not that grammar we are interested in describing. Rather, the language for describing DTDs is essentially a CFG notation, and we want to see how CFGs are expressed in this language.

The form of a DTD is

<!DOCTYPE name-of-DTD [

list of element definitions

]>

An element definition, in turn, has the form

<!ELEMENT element-паше (description of the element)>

Element descriptions are essentially regular expressions. Tlie basis of these expressions are:

1. Other element names, representing the fact that elements of one type can appear within elements of another type, just as in HTML we might find emphasized text within a list.

2. The special term #PCDATA, standing for any text that does not involve XML tags. This term plays the role of variable Text in Example 5.22.

The allowed operators are:

1. I stjmding for union, as in the UNIX regular-expression notation discussed in Section 3.3.1.

2. A comma, denoting concatenation.

3. Three variants of the closure operator, as in Section 3.3.1. These are *, the usual operator meaning zero or more occurrences of, 4-, meaning one or more occurrences of, and ?, meaning zero or one occurrence of

Parentheses may group operators to their arguments; otherwise, the usual precedence of regular-expression operators applies.

Example 5.23 : Let us imagine that computer vendors get together to create a standard DTD that they can use to publish, on the Web, descriptions of the various PCs that they currently sell. Each description of a PC will Ьале a model number, and details about the features of the model, e.g., the amount of R.AM, number and size of disks, and so on. Figure 5.14 shows a hypothetical, very simple, DTD for personal computers.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 [ 65 ] 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175