chapter002 Finite State Automata, Rebular Languages, and Regular Expressions
section004Regular Expressions
Intro to Regular Expressions

Introducing Regular Expressions

Regular expressions.  Now there's a term for you.  What's so regular about them? What do they express? Are there any irregular expressions? Who came up with the term? 

Quit asking so many questions. Most of them don't have answers.  What we can say is this.

Regular expressions provide a way of expressing regular languages in a, well, regular and succint fashion.

An Example: Comparing and FSA With an Equivalent Regular Expression

Let's look at an example.  We have seen that the following finite state automaton recognizes the set of all strings that contain the substring 101 or 010.

 

This FSA provides a way of determining, or computing, whether an arbitrary binary string contains either a 101 or a 010 substring.  That is why we refer to an FSA as a model of computation.  It is a computer, or, better said, an algorithm.

An equivalent regular expression that stands for the same language that this FSA recognizes--all strings that contain either a 101 or a 010 substring--is this:

(1 | 0)* (101 | 010) (1 | 0)*

We'll be more specific about the details later, but for now just realize that the "|" sign is read "or", the "*" means "zero or more," and individual symbols that appear together, as in "101" refer to a the substring obtained by concatenating (gluing) the individual symbols together.  So, we read this regular expression as

"any mixture of zero or more 1's and 0's followed by either a 101 substring or a 010 substring followed by any mixture of zero or more 1s and 0s."

Notice again that it isn't easy to determine what the FSA computes, but (once you are used to reading regular expressions) the regular expression is pretty easy to read in determining what strings are expressed by the regular expression and which are not.

So, we can say this:

The Practical Side

Regular expressions serve a very practical purpose in computer science. They provide completely unambiguous definitions of regular languages.  For example, the phrase

"any mixture of zero or more 1's and 0's followed by either a 101 substring or a 010 substring followed by any mixture of zero or more 1s and 0s."

is ambiguous.  Surely someone will ask, "Is only one 101 substring allowed?"  "Can both a 101 substring and a 010 substring appear in a string in this language?" Good questions, becuase the English is not clear.  However, the regular expression

(1 | 0)* (101 | 010) (1 | 0)*

is completely unambiguous (at least once you know how to read regular expressions).  That's why we use regular expressions to define regular languages. For example,

After looking at this for a while we realize that the English phrase

"any mixture of zero or more 1's and 0's followed by either a 101 substring or a 010 substring followed by any mixture of zero or more 1s and 0s."

can be shortened to

"any binary string that contains a 101 or a 010 substring."

Programming Language Design

Some programming languages, like Perl, lean heavily on regular expresssions for matching patterns in strings as the fundamental programming paradigm of the language. Also, in programming language design, much of the syntax is defined by regular expressions.  For example, in some programming languages, variable names must start with a letter and have only (zero or more) letters and digits thereafter. Given the regular expressions

letter = (a | b | c | ... | x | y | z | A | B | C | ... | X | Y | Z)

digit = (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)

we can precisely specify valid variable names with this regular expression:

letter (letter | digit)*

Notice how this removes any ambiguity about what an identifier is. Without this definition, some may think that an underscore counts as a letter, or that a dash counts as a letter.  Not so given this definition of variable names by way of a regular expression.