All of us recognize the individual words fine (that is, we scan this string just fine). We also parse it fine; that is, we recognize that has the right form for a sentence in English. We might wonder why Joan ate the ball, but we are not at all confused by the form of the sentence. We could translate this sentence into another language, say German, and capture the same meaning:Joan ate the ball.
On the other hand, consider the next string.Joan hat den Ball gegessen.
In this case, we still scan the string fine. We recognize that the individual lexemes--words--represent valid tokens, for example, definite article, verb, noun, noun, but we can't parse this string of words. That is, we recognize immediately that there is something wrong with the form of this string, that it is not a sentence. And since there is something wrong with its form, we can't make sense of it (that is, we can't figure out its semantics---what it means). Are there words missing? Did the ball eat Joan ("Bad ball!")? Do we care any more? We certainly wouldn't try to translate it into another language, because we can't capture its meaning in English, so why try in another language? Another example of a string that scans fine, but does not parse isThe ate Joan ball.
To help see another twist on the situation, consider the following string:ball the the a funny merrily a a a barf.
In this case we both scan and parse the string properly. It "feels" right. The adjectives, nouns, adverbs, and verbs all seem to be in the proper locations. However, the meaning eludes us. Still, because it parses properly (has the right form) we could probably translate it into another language (and the people of that language could puzzle over it).Curious green ideas sleep furiously.
So, for the an English grammar, N, the set of nonterminals would look something like:A grammar G = (N, S, P, S) is a formal definition of a language in which N is a finite set of nonterminal symbols, S is a fintie set of terminal symbols (which we call tokens in compiler theory), P is a set of grammar rules (sometimes called productions), and S is a special nonterminal symbol called the start symbol.
S, the set of terminals would look something likeN = {<sentence>, <subject>, <noun phrase>, <verb phrase>, <direct object>, <indirect object>, ...}
The set of lexemes would be all words in a standard dictionary as well as punctuation. In your study of formal languages, you did not assign lexemes to tokens, because that was not necessary for the abstract study of grammars and languages. The set P of grammar rules would have rules of the formS = {definite_article, indefinite_article, noun, verb, adjective, adverb, comma, period, ...}
The start symbol for this grammar is<sentence> --> <noun phrase> <verb phrase> .
<noun phrase> --> noun
<noun phrase> --> definite_article noun
<noun phrase> --> adjective <noun phrase>
<verb phrase> --> verb <direct object>
<verb phrase> --> verb<indirect object>
.
.
.
To see whether a string such as "Joan ate the Ball." is syntactically correct, one must see whether it follows the rules of the grammar. This can be done as follows:<sentence>
<sentence>
/ \ \___________
/ \ \
<noun phrase> <verb phrase> .
/ / \
/ / \
noun verb <direct object>
(Joan) (ate) / \
/ \
definite_article noun
(the) (ball)
where the strings in parentheses are the lexemes corresponding to the tokens.
Basically, in this instance, the scanner would scan for individual words, then look
them up in a dictionary to see whether they were nouns, verbs, articles, or just what.
Of course, in English, some words fit more than one category (the word
"running" could be a lexeme that matches a noun, verb, or adjective
token), which is one of the
many reasons that English is so difficult to define formally with a grammar.
G = (N, S, P, S), where
N = {<program>, <statements>, <statement>, <expression>, <id_list>, <id_list_rest>, <expression_list>, <expression_list_rest>}
S = {r_begin, r_ end, r_read, r_write, period, eof, comma, id, assign_op, semicolon, add_op, mul_op, integer_literal, l_paren, r_paren}
S = <program>P = {
<program> --> r_begin <statements> r_end period <statements> --> <statement><statements> <statements> --> l <statement> --> id := <expression> semicolon <statement> --> read ( <id_list> ) semicolon <statement> --> write ( <expression_list> ) semicolon <expression> --> <expression> add_op <expression> <expression> --> <expression> mul_op <expression> <expression> --> id <expression> --> integer_literal <expression> --> ( <expression> ) <id_list> --> id <id_list_rest> <id_list_rest> --> , id <id_list_rest> <id_list_rest> --> l <expression_list> --> <expression> <expression_list_rest> <expression_list_rest> --> , <expression> <expression_list_rest> <expression_list_rest> --> l
}