Next: Lexical Analysis Up: Finite Automata Previous: Languages recognized by finite automata

## Regular expressions

REGULAR EXPRESSION. Let be an alphabet such that none of the symbols (, ), |, * belongs to . A regular expression over

• is a word r over the alphabet {(,),|, * },
• denoting (= associated with) a language L(r) over and
• defined by the following inductive rules.
(R1)
The empty word is a regular expression over denoting the language {}.
(R2)
For every x the letter x is a regular expression over denoting the language {x}.
(R3)
Let r and s be regular expressions over denoting languages L(r) and L(s) respectively. Then,
• (r)|(s) is a regular expression over denoting the language L(r  L(s),
• (r)(s) is a regular expression over denoting the language L(r).L(s),
• (r) * is a regular expression over denoting the language (L(r)) * ,
• (r) is a regular expression over denoting the language L(r) (i.e., one can add parentheses).

Remark 2   Unnecessary parentheses can be avoided in regular expressions if we adopt the following conventions:
1.
the unary operator * has the highest precedence and is left associative,
2.
concatenation has the second the highest precedence and is left associative,
3.
the operator | has the lowest precedence and is left associative.
With these conventions the regular expression (a)|((b)*(c)) is a| b*c.

REGULAR LANGUAGES. Let be an alphabet (such that none of the symbols (, ), |, * belongs to it). A language over is regular if there exists a regular expression r over denoting , that is if = L(r).

It follows from Kleene's theorem that a language is recognized by FA iff it is regular.

FROM A REGULAR EXPRESSION TO A FINITE AUTOMATON. We present now a method that for any regular expression r constructs a FA (r) recognizing L(r).

• The FA (r) is built inductively by following the structure of the regular expression r.
• At each step of the algorithm a FA is produced from previous ones starting with FA of R0() and ending with (r).
• Every intermediate FA = (, S, I, F,) produced by the algorithm is NORMALIZED.
The method is quite natural.
1.
We first parse r into its constituent subexpressions.
2.
We construct NFAs for each letter or in r by applying the above rules (R1) and (R2).
3.
Then guided by the syntaxic structure of the regular expression r, we combine the NFAs using rules (R3) and Figures 15, 16, 18.
This method is called Thompson's algorithm.

Remark 3   The number of states of the FA produced by the Thompson's algorithm is upper bounded by 2| r| where | r| is the number of symbols and operators in the regular expression r.

Next: Lexical Analysis Up: Finite Automata Previous: Languages recognized by finite automata
Marc Moreno Maza
2004-12-02