.. _RegEx:
.. raw:: html
.. |--| unicode:: U+2013 .. en dash
.. |---| unicode:: U+2014 .. em dash, trimming surrounding whitespace
:trim:
.. odsalink:: DataStructures/FLA/FLA.css
.. odsalink:: AV/VisFormalLang/Regular/phiREtoNFACON.css
.. odsalink:: AV/VisFormalLang/Regular/lambdaREtoNFACON.css
.. odsalink:: AV/VisFormalLang/Regular/aREtoNFACON.css
.. odsalink:: AV/VisFormalLang/Regular/ABStarOrCCON.css
.. odsalink:: AV/VisFormalLang/Regular/GTGExampleCON.css
.. odsalink:: AV/VisFormalLang/Regular/REtoMinimizedDFACON.css
.. odsalink:: AV/VisFormalLang/Regular/RegExGTGCON.css
.. odsalink:: AV/VisFormalLang/Regular/RegExGTG3sCON.css
.. odsalink:: AV/VisFormalLang/Regular/RegExConvertCON.css
.. odsalink:: AV/VisFormalLang/Regular/NFAtoRECON.css
.. This file is part of the OpenDSA eTextbook project. See
.. http://algoviz.org/OpenDSA for more details.
.. Copyright (c) 2012-2016 by the OpenDSA Project Contributors, and
.. distributed under an MIT open source license.
.. avmetadata::
:author: Susan Rodger and Cliff Shaffer
:requires:
:satisfies: Regular Expression
:topic: Finite Automata
Regular Expressions
===================
Regular Expressions
-------------------
**Regular expression** (RegEx or R.E.) is a way to specify a set of
strings that defines a language.
There are three operators that can be used:
* :math:`+` union (OR)
* :math:`\cdot` concatenation (AND)
* :math:`*` star-closure (repeat 0 or more times)
We often omit showing the :math:`\cdot` for concatenation.
| Example:
| :math:`(a + b)^* \cdot a \cdot (a + b)^* = (a + b)^*a(a + b)^*`
| Q: What language is this?
| A: All strings from :math:`\{a, b\}^*` that contain at least one
:math:`a`.
| Example:
| :math:`(aa)*`
| Q: What language is this?
| A: Strings of :math:`a` 's with an even number of :math:`a` 's.
| Note that we need to be careful about our alphabet.
Here, we only want strings of :math:`a` 's regardless of the
actual alphabet. Written as a RegEx, this is clear regardless of
whether the actual alphabet contains other letters (such as b's).
But its not so clear in English when we say "strings
with an even number of a's" if we want to rule out "aabaa" from
the alphabet of :math:`\Sigma = \{a, b\}` or not.
|
**Definition:** Given :math:`\Sigma`,
#. :math:`\emptyset`, :math:`\lambda`, and :math:`a \in \Sigma` are R.E.
#. If :math:`r` and :math:`s` are regular expressions, then
* :math:`r + s` is a R.E.
* :math:`r s` is a R.E.
* :math:`(r)` is a R.E.
* :math:`r^*` is a R.E.
#. :math:`r` is a R.E if and only if it can be derived from (1) with
a finite number of applications of (2).
**Definition:** :math:`L(r)` is the language denoted by regular
expression :math:`r`.
#. :math:`\emptyset`, :math:`\{\lambda\}`, and :math:`\{a \in \Sigma\}` are each languages denoted by some R.E.
Note that :math:`\emptyset = \{\}` (the empty set),
while :math:`\lambda = \{ \lambda \}`,
meaning the set containing just the empty string.
Q: Does every regular language include the empty string?
#. If :math:`r` and :math:`s` are R.E. then
* :math:`L(r + s) = L(r) \cup L(s)`
* :math:`L(r s) = L(r) \cdot L(s)`
* :math:`L((r)) = L(r)`
* :math:`L((r)^*) = (L(r)^*)`
Precedence Rules
~~~~~~~~~~~~~~~~
* :math:`*` highest
* :math:`\cdot`
* :math:`+` lowest
Example: :math:`ab^* + c = (a(b)^*) + c`
Examples of Regular Expressions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#. :math:`\Sigma = \{a,b\}`,
:math:`\{w \in {\Sigma}^{*} \mid w`
has an odd number of :math:`a` 's followed by an even number of
:math:`b` 's :math:`\}`.
:math:`(aa)^{*}a(bb)^{*}`
Q: Does this language include the empty string?
#. :math:`\Sigma=\{a,b\}`, :math:`\{w \in {\Sigma}^{*} \mid w` has no more than
three :math:`a` 's and must end in :math:`ab\}`.
:math:`b^{*}(ab^{*} + ab^{*}ab^{*} + \lambda)ab`
#. Regular expression for positive and negative integers
:math:`0 + (- + \lambda)((1+2+\ldots +9)(0+1+2+\ldots +9)^{*})`
Q: What is acceptable for this language, and what is not acceptable?
Q: Can every finite set of strings be described by a R.E.?
You should be able to answer this question.
Now that we have defined what regular expressions are, a good question
to ask is: Why do we need them?
In particular, we already know two ways to define a language.
One is conceptually, as an English description with more or less
mathematical precision.
The other is operationally in the form of a DFA (or equivalently, an
NFA).
So why yet another representation?
A good answer is that the other two representations are deficient
(in different ways) that the regular expression overcomes.
Describing it in English is imprecise.
Even if we use math to make it precise, its something that we cannot
easily operationalize.
On the other hand, defining a DFA (or NFA) is a bit time consuming.
We can use a tool like JFLAP or OpenFLAP, but that takes a relatively
long time to work through the GUI, even if it is a relatively small
machine.
In contrast, we can type out a regular expression within a system like
JFLAP.
In that case, it is both fast to type and operationalizeable
(in the sense that we can then convert the RegEx to a DFA, which
implements the acceptor for the regular expression).
Of course, while a program (or machine) can be shorter or longer,
it might be hard for us to come up with the program.
In the same way, we might have to struggle to come up with the regular
expression.
But its probably short to type once we have it.
Regular Expressions vs. Regular Languages
-----------------------------------------
Recall that we **define** the term :term:`regular language` to mean
the languages that are recognized by a DFA.
(Which we know is the same as the languages recognized by an NFA,
because we know that every NFA can be converted to a DFA.)
How do regular expressions relate to these?
Are they the same languages?
Is one a subset of the other?
Or are they just different collections of languages?
We can easily see NFAs for :math:`\emptyset`, :math:`\lambda`, and
:math:`a \in \Sigma`.
Here is an NFA that accepts nothing (:math:`\emptyset`).
.. inlineav:: phiREtoNFACON dgm
:output: show
Here is an NFA that accepts an empty string (:math:`\lambda`).
.. inlineav:: lambdaREtoNFACON dgm
:output: show
Here is an NFA that accepts :math:`a \in \Sigma`.
.. inlineav:: aREtoNFACON dgm
:output: show
But what about the "more interesting" regular expressions that are
built from AND, OR, and concatenation?
Do these all have maching NFAs?
If we could find a way to "simulate" each of these operations with an
NFA, then we know that we can construct a machine for any R.E.
This idea of "simulation" is a standard approach to proving such things!
Suppose that :math:`r` and :math:`s` are R.E. (By induction...)
That means that there is an NFA for :math:`r` and an NFA for
:math:`s`.
To help us visualize such things, it helps if we can have a standard
way to draw the idea of an arbitrary NFA.
And since we want to combine machines together, it will be much easier
if we know that the arbitrary machine has one start state and one
final state.
Well, we already know that all NFAs have a single start state.
But not all NFAs have a single final state.
.. Note::
Consider any NFA, and its various final states.
Is there an easy way to convert this to an **equivalent NFA** with a
**single final state**?
The answer is "yes", by adding a new state that will be the final
state for the machine.
Figure out for yourself how you can do this.
The following slideshow shows how to convert an NFA with multiple
final states to one with a single final state.
.. inlineav:: schematicRepCON ss
:points: 0.0
:required: False
:threshold: 1.0
:id: 184203
:output: show
Our next step is to show how, for each R.E. operator, we can build an
NFA that implements the behavior of that operator on its operand NFAs.
OR: :math:`r + s`.
This means that we have NFAs :math:`r` and :math:`s`, and we want a
new NFA that OR's them together.
Simply add a new start state and a new final state,
each connected (in parallel) with :math:`\lambda`
transitions to both :math:`r` and :math:`s`.
.. inlineav:: schematicORRepCON ss
:points: 0.0
:required: False
:threshold: 1.0
:id: 184204
:output: show
:math:`r \cdot s`. Add new start state and new final state,
and connect them with :math:`\lambda` transitions in series.
.. inlineav:: schematicConcatRepCON ss
:points: 0.0
:required: False
:threshold: 1.0
:id: 184205
:output: show
:math:`r^*`. Add new start and final states, along with
:math:`\lambda` transitions that allow free movement between
them all.
.. inlineav:: schematicStarRepCON ss
:points: 0.0
:required: False
:threshold: 1.0
:id: 184206
:output: show
Create a slideshow that shows how we create a machine that
implements star closure, like Linz Figure 3.5
**Example:** :math:`ab^* + c` (Multiple Final Stage)
Here is an NFA that accepts :math:`ab^* + c`
.. inlineav:: ABStarOrCCON dgm
:output: show
Since we have the NFA that accepts :math:`ab^* + c`, we can convert it to a DFA
then to a minimized DFA.
.. inlineav:: RENFAtoDFACON ss
:points: 0.0
:required: False
:threshold: 1.0
:id: 184208
:output: show
.. note::
At this point, you should go to OpenFLAP and try it for yourself.
Type in the R.E, then convert it to an NFA,
then convert the NFA to a DFA, then minimize the DFA.
You should notice that when OpenFLAP automatically converts the
R.E. to a NFA, the resulting NFA does not look like the "intuitive"
version in the diagram above.
This is because the automatic process is a little more complicated.
To understand how an algorithm can automatically convert an R.E. to a
NFA, a lot of the steps are simply building the machine with the
transformations in the diagrams shown earlier in this module |---|
such as combining two machines to OR them or to AND them, etc.
**Definition:** A Generalized Transition Graph (GTG) is a transition
graph whose edges can be labeled with any regular expression.
Thus, it "generalizes" the standard transition graph.
.. inlineav:: GTGExampleCON dgm
:output: show
The process for automatically converting from a R.E. to an NFA simply
moves step by step through the R.E. from the lowest precedence
operators (OR) to break the R.E. down into partial machines that are
combined together.
It is fairly simple process, as seen here.
.. inlineav:: GTGtoNFACON ss
:points: 0.0
:required: False
:threshold: 1.0
:id: 184210
:output: show
One thing that this example should make clear is that the concept of
an NFA is really helpful for our understanding.
While every NFA **can** be replaced by an equivalent DFA,
it is a lot easier to understand instuitively the process of
converting an R.E. to an NFA than it would be if we had
come up with the DFA directly.
Finally, here is a slideshow that presents all of the details that an
automated process would go through to convert an R.E. to a minimized DFA.
.. inlineav:: REtoMinimizedDFACON ss
:points: 0.0
:required: False
:threshold: 1.0
:id: 184211
:output: show
Converting Regular Languages to Regular Expressions
---------------------------------------------------
Since every regular expression has an NFA that implements it,
this means that the regular expressions are a subset of
the regular languages.
The next question is: Does every regular language have a regular
expression?
**Theorem:** Let :math:`L` be regular.
Then there exists an R.E. such that :math:`L = L(r)`.
Perhaps you thought it fairly intuitive to see that any regular
expression can be implemented as a NFA, as described above.
But for most of us, its not obvious that any NFA can be converted to a
regular expression.
This proof is rather difficult, and we are just going to give a sketch.
| Proof Idea:
| Use a process that removes states sucessively,
generating equivalent generalized transition graphs (GTG) until
only two states are left (the initial state and one final state),
with the resulting regular expression as the transition.
| This regular expression left as the sole transition is equivalent
to the original NFA.
**Definition:** A complete GTG is a complete graph, meaning that every
state has a transition to every other state.
Any GTG can be converted to a complete GTG by adding edges labeled
:math:`\emptyset` as needed.
**Proof:**
:math:`L` is regular :math:`\Rightarrow \exists` NFA :math:`M` such
that :math:`L = L(M)`.
#. Assume :math:`M` has one final state, and :math:`q_0 \notin F`.
#. Convert :math:`M` to a complete GTG.
Let :math:`r_{ij}` stand for the label of the edge from :math:`q_i`
to :math:`q_j`.
#. If the GTG has only two states, then it has this form:
.. inlineav:: RegExGTGCON dgm
:align: center
Add an arrow to the start state. Then, the corresponding regular
expression is:
:math:`r = (r^*_{ii}r_{ij}r^*_{jj}r_{ji})^*r^*_{ii}r_{ij}r^*_{jj}`
Of course, we might have a machine with its start state also a
final state.
There are two ways to deal with this.
One is to come up with a rule in this case.
(Hint: Its the same rule, with an extra "OR" added for the case
where we stay in the start state.)
The other is to first convert our NFA to one with a single final
state (separate from the start state).
This is really easy to do, and is probably a homework problem for
the class.
#. If the GTG has three states, then it must have the following form:
.. inlineav:: RegExGTG3sCON dgm
:align: center
In this case, make the following replacements:
.. math::
\begin{array}{lll}
REPLACE & \ \ \ \ \ \ \ \ & WITH \\ \hline
r_{ii} && r_{ii}+r_{ik}r_{kk}^{*}r_{ki} \\
r_{jj} && r_{jj}+r_{jk}r_{kk}^{*}r_{kj} \\
r_{ij} && r_{ij}+r_{ik}r_{kk}^{*}r_{kj} \\
r_{ji} && $r_{ji}+r_{jk}r_{kk}^{*}r_{ki} \\
\end{array}
After these replacements, remove state :math:`q_k` and its edges.
#. If the GTG has four or more states, pick any state :math:`q_k` that
is not the start or the final state.
It will be removed.
For all :math:`o \neq k, p \neq k`, replace :math:`r_{op}` with
:math:`r_{op} + r_{ok}r^*_{kk}r_{kp}`.
When done, remove :math:`q_k` and all its edges.
Continue eliminating states until only two states are left.
Finish with step (3).
#. In each step, we can simplify regular expressions :math:`r` and
:math:`s` with any of these rules that apply:
| :math:`r + r = r` (OR a subset with itself is the same subset)
| :math:`s + r{}^{*}s = r{}^{*}s` (OR a subset with a bigger subset
is just the bigger subset)
| :math:`r + \emptyset = r` (OR a subset with the empty set is just
the subset)
| :math:`r\emptyset = \emptyset` (Intersect a subset with the empty
set yields the empty set)
| :math:`\emptyset^{*} = \{\lambda\}` (Special case)
| :math:`r\lambda = r` (Traversing a R.E. and then doing a free
transition is just the same R.E.)
| :math:`(\lambda + r)^{*} = r^{*}` (Taking a free transition adds nothing.)
| :math:`(\lambda + r)r^{*} = r^{*}` (Being able to do an option
extra :math:`r` adds nothing)
And similar rules.
.. inlineav:: RegExConvertCON dgm
:align: center
You should convince yourself that, in this image, the right side is a
proper re-creation of the left side.
In other words, the R.E labeling the self-loop for the left state
in the right machine is correctly characterizing all the ways that one
can remain in state :math:`q_0` of the left machine.
Likewise, the R.E. labeling the edge from the left state to the right
state in the machine on the right is correctly characterizing all the
ways that one can go from :math:`q_0` to :math:`q_2` in the machine on
the right.
We have now demonstrated that regular expressions are equivalent to DFAs.
Meaning that given any regular expression, we have an algorithm to
convert it to some DFA.
And vice versa.
.. inlineav:: NFAtoRECON ss
:points: 0.0
:required: False
:threshold: 1.0
:id: 184212
:output: show
.. odsascript:: DataStructures/FLA/FA.js
.. odsascript:: AV/VisFormalLang/Regular/phiREtoNFACON.js
.. odsascript:: AV/VisFormalLang/Regular/lambdaREtoNFACON.js
.. odsascript:: AV/VisFormalLang/Regular/aREtoNFACON.js
.. odsascript:: AV/VisFormalLang/Regular/schematicRepCON.js
.. odsascript:: AV/VisFormalLang/Regular/schematicORRepCON.js
.. odsascript:: AV/VisFormalLang/Regular/schematicConcatRepCON.js
.. odsascript:: AV/VisFormalLang/Regular/schematicStarRepCON.js
.. odsascript:: AV/VisFormalLang/Regular/ABStarOrCCON.js
.. odsascript:: AV/VisFormalLang/Regular/RENFAtoDFACON.js
.. odsascript:: AV/VisFormalLang/Regular/GTGExampleCON.js
.. odsascript:: AV/VisFormalLang/Regular/GTGtoNFACON.js
.. odsascript:: AV/VisFormalLang/Regular/REtoMinimizedDFACON.js
.. odsascript:: lib/paper-core.min.js
.. odsascript:: DataStructures/FLA/REtoFAController.js
.. odsascript:: lib/underscore.js
.. odsascript:: DataStructures/FLA/Discretizer.js
.. odsascript:: AV/VisFormalLang/Regular/RegExGTGCON.js
.. odsascript:: AV/VisFormalLang/Regular/RegExGTG3sCON.js
.. odsascript:: AV/VisFormalLang/Regular/RegExConvertCON.js
.. odsascript:: AV/VisFormalLang/Regular/NFAtoRECON.js