CS 550 Fall 2003 - Program Implementation Notes.

Program 1 - Scanner

The first program was a simple scanner to pick out all the possible Ada identifiers from a file.

I created a simple deterministic Finite State Automaton (FSA) that will recognize Ada identifiers and implemented it in C. If obeys the following rules:

The Ada 95 language manual has the following EBNF for identifiers:
identifier ::= identifier_letter {[underline] letter_or_digit}
letter_or_digit ::= identifier_letter | digit
identifier_letter ::= upper_case_identifier_letter | lower_case_identifier_letter
digit ::= '0' | '1' | ... | '9'
underline ::= '_'
upper_case_identifier_letter ::= Any character of Row 00 of ISO 10646 BMP whose name begins 'Latin Capital Letter'
lower_case_identifier_letter ::= Any character of Row 00 of ISO 10646 BMP whose name begins '`Latin Small Letter'
For simplicity's sake, just use
upper_case_identifier_letter ::= 'A' | 'B' | ... | 'Z'
lower_case_identifier_letter ::= 'a' | 'b' | ... | 'z'

A regular expression which corresponds to this EBNF is    [A-Za-z][A-Za-z0-9_]*

The program implements a deterministic FSA that corresponds to this regular expression.

Program 2 - First and Follow Sets

I implemented this program using C++ and classes from the Standard Template Library (STL). The STL container classes that I used were the following:

list To hold lists of symbols, for the right sides of productions. Also holds list of productions for the grammar.
map To hold the collection of terminals in the grammar, and the collection of non-terminals.
bitset To hold sets of terminals (used to build the first and follow sets)

One interesting aspect of this program was coming up with an efficient way to implement sets in the STL. The STL does include a set container class, but it does not readily provide some of the set operators, like union and intersection. An efficient mechanism for implementing sets, if you can make a good guess at the cardinality of the set, is to use a bit map, with each bit representing a potential member. I implemented this in C++, using the STL bitmap. You need to specify the number of bits in a set, so I arbitrarily chose 256. In order to use the bitmap to hold sets of terminals, I needed to decide which bit represented each terminal. To keep track of this, I put all the terminals into a map - this allows an association between a key value (the terminal symbol name) and a data value (the ordinal value of that terminal - its bit number in the bitset).

Once the terminals were added to the map and assigned ordinals, the bitset provided a very good means of storing the sets. The standard AND and OR logical operators allowed very simple unions and intersections, and the STL's use of standard array index notation made it easy to check the set for the presence of a particular terminal.

I also used a map to store the non-terminals; the key was again the non-terminal name, and the contents was a structure containing the first and follow sets for the non-terminal.

Another interesting aspect of the program was the mechanism to parse the input XML files. I had not done anything with XML before, and I ended up using the expat library available as a downloadable package for Linux and many other operating systems. This XML parser is based on the SAX parsing model - this is an event-driven model where callback functions are invoked as start and end tags are recognized in the input. This is more memory efficient than a DOM-based parser, which loads the entire document into a tree structure as it is parsed. The DOM parsers have several advantages, including doing much more validation and error handling, but for the simple input format for this program, the SAX parser was sufficient.

The C++ implementation requires more lines of source than some of the other solutions that were programmed in languages like Perl or Python, but I think that it is probably a more maintainable implementation, and I expect that it would be more efficient (although efficiency is not of great importance in a program like this that is not run frequently on large input data sets).

After an initial learning curve, the STL proved to be very useful. Once you are acquainted with the concept of iterators and how they are used in conjunction with the container classes, it was reasonably straightforward to use the various containers.

Program 3 - LL(1) Table Generator

I just added code to the previous implementation to generate the LL(1) table, since it required the First and Follow sets and grammar rules. I used the additional vector container class from the STL.

Rather than creating a separate matrix indexed by non-terminal and terminal, I just added a vector of production numbers to the structure for each non-terminal (along with the first and follow set bitsets). This vector contains an entry for each terminal (indexed by the terminal's ordinal value, just like the bits in the bitsets). Each entry is actually a list of production numbers, so I can handle non LL(1) grammars that have multiple rule numbers for some combinations of terminal and non-terminal. This data structure was easy to deal with, since I usually had to look up the entry for the non-terminal in the map anyway, to get its First or Follow set.

The C++ implementation proved to be easy to build upon to add the LL(1) table functionality. Again, the STL containers were useful, after some initial confusion about how to get information into a vector (don't try to just do vector [9] = (some structure); it doesn't generate an error, but I'm not sure where the data went).

Program 4 - CFSM Generator

I built on program 3 to add the CFSM generation. I hoped to use another STL container, the set, to store the states, so I could easily determine if a state that I generated was already in the machine. However, it turns out that the set does not provide any direct update access to its contained items (since the set stores items in order, changing the contents of an item might invalidate its position in the set's order); the set iterators are all constant iterators. I was generating the states then updating their next state links, so that ruled out the set approach.

I ended up just storing the states in a list. I generated the start state, computed the closure, and stuffed it in the list. Then I just went into a loop that examined the next state in the list (starting from the first). If any next state was already in the list, it just fixed the next state pointer to point to the entry in the list; otherwise, it added the new state to the end of the list. This process was repeated until every state in the list had been examined.

I shamelessly stole Ming's idea of generating HTML output for the CFSM and using anchor tags to jump between states. An actual graphical representation of the state machine that you could navigate by clicking on edges would have been cooler, but too much work at this point.

In order to make it so I could output the HTML to a separate file, I had to update all of my Print routines to be able to accept an output stream. A nice feature of C++ is the default value for parameters - I made the default stream cout, so I didn't have to change any calls except the ones that actually wanted to write to a file instead of standard output.

Another thing I discovered - if you are writing HTML, output the non-terminals as <non-terminal-name> is a bad idea! HTML eats up things between the angle brackets. So I added another optional parameter that said to use the &lt; and &gt; instead of <> if generating HTML.

Program 5 - LR Parser

I just added more extensions to program 4 to create the SLR parse table. Note that I followed the suggestion in Fischer & Leblanc and merged the action and goto tables into a single parse table. It is indexed by state and by symbol; if it contains a 0, that indicates a parse error. A negative value means a shift, with the value being the state number. A positive value means a reduce, with the value being the production number. A shift to the accept state (the one where the augmented production gets reduced) is recognized and replaced with a special value of 9999 to indicate an accept condition.

As was probably inevitable with tacking so many things onto the code I wrote for assignment 2, it could use some restructuring. In particular, the class hierarchy is a little messy - I used "friend" classes to swipe the internals of the grammar for the CFSM and the internals of the grammar and the CFSM for the SLR parse table construction. It should probably be reorganized using contained classes to tidy things up.

Some of the data structures could also use a little revamp - I discovered that nowhere did I have a simple list of the symbols in the grammar. There was one for the terminals, but not for non-terminals. Makes it a little difficult to index the parse table, so I had to whip something up just for that.

I did get the parser running to use the table - it's the simple little shift reduce parser from the start of Chapter 6 of Fischer & Leblanc.

If I had more time, I would have liked to take a stab at the LALR(1) parser. Also, given the SLR(1) parse table, Fischer and Leblanc suggest a couple more space optimizations that can be used; one in particular would have been good to implement. They talk about single reduce states - rows in the parse table that have only errors or reduces by the same production for one or more lookahead tokens. These states are fairly common in the tables for many grammars. The entire row can be eliminated from the table, and any shifts to that state can be replaced by a special operation (they call it an "L") that would do the reduction immediately rather than shifting and then reducing.

In the end, C++ is probably not the easiest language to use to implement something like this. It is the one with which I am most familiar, though. And performance-wise, I suspect this is going to be the fastest running alternative.

Mail me at: bwall@cs.montana.edu

Last modified: Dec. 18, 2003