Introducing Regular Expressions
Regular expressions. Now there's a term for you. What's so regular about them? What do they express? Are there any irregular expressions? Who came up with the term?
Quit asking so many questions. Most of them don't have answers. What we can say is this.
Regular expressions provide a way of expressing regular languages in a, well, regular and succint fashion.
An Example: Comparing and FSA With an Equivalent Regular Expression
Let's look at an example. We have seen that the following finite state automaton recognizes the set of all strings that contain the substring 101 or 010.
This FSA provides a way of determining, or computing, whether an arbitrary binary string contains either a 101 or a 010 substring. That is why we refer to an FSA as a model of computation. It is a computer, or, better said, an algorithm.
An equivalent regular expression that stands for the same language that this FSA recognizes--all strings that contain either a 101 or a 010 substring--is this:
(1 | 0)* (101 | 010) (1 | 0)*
We'll be more specific about the details later, but for now just realize that the "|" sign is read "or", the "*" means "zero or more," and individual symbols that appear together, as in "101" refer to a the substring obtained by concatenating (gluing) the individual symbols together. So, we read this regular expression as
"any mixture of zero or more 1's and 0's followed by either a 101 substring or a 010 substring followed by any mixture of zero or more 1s and 0s."
Notice again that it isn't easy to determine what the FSA computes, but (once you are used to reading regular expressions) the regular expression is pretty easy to read in determining what strings are expressed by the regular expression and which are not.
So, we can say this:
- The FSA provides a way of computing whether a given string is in the regular language or not.
- The FSA does not provide a concise, easily read description of the language it recognizes.
- The regular expression provides a concise and precise description (expression) of what the language is.
- The regular expression does not provide a way of easily computing whether a given string is in the language or not.
The Practical Side
Regular expressions serve a very practical purpose in computer science. They provide completely unambiguous definitions of regular languages. For example, the phrase
"any mixture of zero or more 1's and 0's followed by either a 101 substring or a 010 substring followed by any mixture of zero or more 1s and 0s."
is ambiguous. Surely someone will ask, "Is only one 101 substring allowed?" "Can both a 101 substring and a 010 substring appear in a string in this language?" Good questions, becuase the English is not clear. However, the regular expression
(1 | 0)* (101 | 010) (1 | 0)*
is completely unambiguous (at least once you know how to read regular expressions). That's why we use regular expressions to define regular languages. For example,
- 101 and 010 are both strings in the language represented by this regular expression; both have either a 101 or a 010 substring with zero leading and trailing 1's and 0's.
- 101010 is a string in the language represented by this regular expression. We can view this string variously as below, where each of the underlined 101 or 010 substrings shows that the whole string can be viewed as having a mixture of zero or more 1's and 0's preceding and following the underlined substring:
- 101010
- 101010
- 101010
- 101010
- 1111000111 is not in the language represented by this regular expression, because there are no 101 or 010 substrings in this string.
After looking at this for a while we realize that the English phrase
"any mixture of zero or more 1's and 0's followed by either a 101 substring or a 010 substring followed by any mixture of zero or more 1s and 0s."
can be shortened to
"any binary string that contains a 101 or a 010 substring."
Programming Language Design
Some programming languages, like Perl, lean heavily on regular expresssions for matching patterns in strings as the fundamental programming paradigm of the language. Also, in programming language design, much of the syntax is defined by regular expressions. For example, in some programming languages, variable names must start with a letter and have only (zero or more) letters and digits thereafter. Given the regular expressions
letter = (a | b | c | ... | x | y | z | A | B | C | ... | X | Y | Z)
digit = (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)
we can precisely specify valid variable names with this regular expression:
letter (letter | digit)*
Notice how this removes any ambiguity about what an identifier is. Without this definition, some may think that an underscore counts as a letter, or that a dash counts as a letter. Not so given this definition of variable names by way of a regular expression.