A theory of computation course normally begins by talking about the concept of
formal languages. At first it might seem like the idea has little to do with computers, but it's probably one of the most important bits of theory in all of computer science. An understanding of formal languages is necessary for writing compilers and designing natural language processing algorithms, but even more importantly they allow us to determine what sort of problems can be solved by computers at all. This post will cover the background of formal languages and
automata to make it easier to understand the more difficult material that may show up in a theory of computation class.
What are formal languages?
Formal languages are distinct from both programming languages and natural languages - the type of language that you and I speak. In the theory of computation, they are normally referred to simply as
languages, since it is usually clear what we mean when we mention them. A
language is a set of
rules that govern the structure of
strings, which are composed of
alphabets. A language has an associated
automaton, which can be used equivalently to express the set of strings contained in a language. Strings in the set of a language are said to be
accepted by its automaton, while strings in the language's complement (those not included in the language) are said to be
rejected.
To illustrate what all of this means, let's start with an example.
I have an alphabet which contains the letters
a and
b. With these two
symbols, as they're called, I can construct an infinite number of strings:
a, b, aa, ab, ba, bb, aaa, aab, and so on. But I don't want to build just any string out of these symbols, I want to be able to build a string that has a certain characteristic. I want only those strings that do not contain repeated symbols - no appearance of the substring
aa or
bb. If I wanted to, I could simply list all of these strings forever until the end of time:
a, b, ab, ba, aba, bab, abab, baba...
But this is really inconvenient. I would much rather define a set of rules that could be used to
generate these strings, so that I can capture the essence of my language with a finite amount of effort. I can explain my language in plain English, as I have done already, or I can use a more formal mathematical notation which is universally recognized by all computer scientists. In this case, my language is simple enough to be described using a
regular expression. My language is in the set of
regular languages.
Regular Languages
There are different classifications of languages. Regular languages are the simplest kind of languages, consisting of rules that require no
memory to verify. That is to say, I only need to look one symbol to the left to know if my string is violating the rules of my language. To verify a string in the language we were just dealing with, the idea is simple: look at each symbol starting at the first character and ending at the last character. For each, make sure that the previous character is different from the current one. If we reach the end without any problems, accept the string. Otherwise, reject it.
Describing this process using a regular expression ensures that there is no confusion among other computer scientists what strings my language accepts. In formal notation, the regular expression for this language is the following:
$$(ab)^*a \: | \: (ba)^*b \: | \: (ab)^* \: | \: (ba)^*$$
There are several key mathematical symbols in this expression, each of which specify some sort of operation.
Parentheses are used just as we would use them in arithmetic - they modify the order of operations by grouping symbols and other operations together, recursively. Operations within a set of parentheses must be performed first. Then, we have what's called the
kleene star, represented as an asterisk. Symbols operated on by a kleene star can be repeated any number of times, from zero to infinity. Finally, we have the
union symbol, represented by a vertical bar. Just like in set theory, unions in regular expressions combine two languages together to make a more general language that accepts strings in either one.
In this particular expression, there are four general possibilities. We can have a string consisting of repeating
ab substrings, ending in
a. We can have a string with repeating
ba substrings, ending in
b. Or we can have repeating
ab or
ba substrings, with no
a or
b at the end, respectively. Note that the strings
a and
b are in the language, because the substring in the parentheses can be repeated zero times, resulting in an empty string. In fact, we can just have
an empty string with nothing in it at all.
There are other operations besides those previously described. A plus sign indicates the same thing as a kleene star, except there must be at least one occurrence of the substring being operated on. A question mark indicates zero or one occurrence. A substring raised to the
nth power is repeated exactly
n times. Even more operations exist, but a theory of computation course will usually limit itself to the aforementioned selection and focus on the core concepts, rather than force you to memorize the notation for every possible operation.
Now suppose we wanted to construct some kind of machine that would
recognize strings in our language. This machine would cycle through a finite set of states, moving between them every time it encounters a symbol in the input string. The machine would begin at some state, and end at some other state, either accepting the string or rejecting it if the language's rules have been violated. There are several different types of machines than can perform this task. The simplest one is called a
deterministic finite automaton.
Deterministic Finite Automata
First, some etymology.
Determinism is the tendency of a system to always behave the same way given the same starting conditions and input. This is in opposite to
Indeterminism, which is the tendency of a system to perhaps behave differently even though the same starting conditions and input may be given to it. A deterministic finite automaton is a finite state machine that is predictable because it is never given a
choice between two possibilities - when analyzing a string, it must always proceed in the same manner every time. This concept may be a bit esoteric for the time being, but we will return to it later when discussing a different type of finite automaton.
Taking our language and constructing a DFA for it, we get the following state machine diagram:
What this diagram is telling us is this: First, initialize the machine with state q
0. As we encounter each new symbol in the input string, have the machine react by moving to the appropriate state. If by the time we reach the end of the string we are in an accept state (indicated by double circles), accept the string. Otherwise, reject the string.
Giving this DFA the valid input string
abab, we would observe the state history q
0 => q
1 => q
2 => q
1 => q
2. Because this run of the automaton ended on q
2 and q
2 is an accept state, it would accept the string. Giving the input string
abb would result in rejection because the automaton would halt on state q
4, which is not an accept state.
Deterministic finite automata are only able to recognize the set of regular languages, because they have no memory. If we were to take this diagram and construct an actual physical machine capable of performing string recognition for this language, we would only need one register to store the current state of the machine, and nothing else. This is of course ignoring whatever mechanism is being used to feed the string to the machine.
Bringing back the concept of determinism, it is clear that DFAs do not make a choice between a set of paths. There is only one path the DFA can take to the next state from its current state for each symbol in the alphabet. This is what makes them deterministic. But the DFA has an indeterministic cousin, the
nondeterministic finite automaton, which is able to get around this limitation.
Nondeterministic Finite Automata
Nondeterministic finite automata introduce the concept of multiple concurrent pathways to finite state machines. While DFAs are forced to be
complete and can only have one transition per symbol in the alphabet, NFAs are not required to force these restrictions on themselves. NFAs can be
incomplete, meaning that they do not have to define a transition away from a state for every symbol in the alphabet, while DFAs are required to (
in most contexts - some theory of computation courses opt not to put this restriction on DFAs). They can also define special
epsilon transitions.
An NFA for our language might look like this:
Just like a DFA, an NFA like this one looks at one character in the string at a time from left to right, and moves between states on each encountered symbol. The epsilon symbol, seen branching four times from state q
0, represents the empty string. When an NFA takes an epsilon state transition, it does not consume a symbol in the string. Rather, it moves to the corresponding state without even looking at whatever character is next.
It is also permitted to have multiple transitions of the same symbol exiting the same state - whenever this occurs, the NFA is given multiple choices and can take either transition. The benefit to this behavior is that only one path needs to successfully recognize the string in order for it to be accepted. If the NFA takes a path that leads to a rejection, it can roll itself back and try some other path instead. If no path leads to acceptance, then the string is rejected.
If an NFA encounters a symbol that does not have a transition away from its current state, the machine will halt and reject the string (or roll itself back and try another path). In some theory of computation courses, this behavior is also granted to DFAs (the halting part). In that case, the only features discerning an NFA from a DFA is the allowance of branching and epsilon transitions.
It is important to realize that there are multiple ways of building automata for the same language. This NFA has more states than the DFA we built, but it better captures the union operations in the regular expression. It is entirely possible to build an NFA with fewer states and transitions than this one. In fact, NFAs usually have fewer states than DFAs due to the reduced number of restrictions put on them.
Like DFAs, NFAs recognize only regular languages. It can be said that DFAs, NFAs, and regular expressions are equally
powerful. The more languages that can be recognized by an automaton, the more powerful they are said to be. There are other types of languages that require more powerful automata than the ones just discussed.
The Power of Languages
What is it about certain languages that require a more powerful automaton to recognize, exactly? Is it that languages that require more power accept more strings? Well, as discussed when we were first choosing a language to implement, languages are actually
restrictive on input. A weak automaton would just accept any old string that was passed its way, because it is less able to recognize certain patterns within strings. So languages that require more power actually accept
fewer strings, not more.
There is an entire branch of computational theory that deals with the complexity of systems in general. Languages are included in this analysis, and have their own hierarchy - the
Chomsky hierarchy - named after computer scientist and philosopher
Noam Chomsky. Below we see the structure of this hierarchy:
We have covered the smallest circle in this diagram so far. Notice that the higher classifications of languages encompass each other. This means that more powerful automata are able to recognize every language that their less powerful counterparts can, and more. An automaton that is able to recognize
context-free languages is necessarily able to recognize regular languages as well. Likewise, an automaton capable of recognizing
context-sensitive languages is also capable of recognizing context-free languages, and so on.
In order to demonstrate exactly what kind of languages require these automata, let's move on to
context-free grammars.
Context-Free Grammars
Regular languages are limited in a way that context-free grammars are not. A regular expression, DFA, or NFA is incapable of expressing any sort of recursive structure. Context-free grammars were specifically designed for this purpose. A context-free grammar consists of a set of
production rules which replace
nonterminal symbols with
terminal symbols or other nonterminal symbols. Recursion can be accomplished by creating a production rule that replaces a nonterminal symbol with itself, plus other symbols.
Take, for example, a language which accepts strings that consist of
a repeated some number of times, followed by
b repeated that same number of times. That is, a
nb
n where n is any positive integer number. This language is clearly not regular - a DFA or an NFA can't remember how many times it has seen the symbol
a when it comes time to count how many times
b shows up. A context-free grammar, on the other hand, has no problem performing this task. Take the set of production rules below:
$$S \rightarrow aSb$$
$$S \rightarrow \epsilon$$
The way these production rules work is simple - begin with the start symbol
S, a nonterminal symbol. Replace it with the right hand side in either rule until you are left with only terminal symbols (those in the alphabet). When replacing with epsilon, just remove the nonterminal symbol entirely. We can apply these rules to generate the string
aaabbb: S => aSb => aaSbb => aaaSbbb => aaabbb.
This is great if we want to generate a string that requires this kind of recursive structure. But we are still in need of an appropriate recognizer for these languages. DFAs and NFAs won't do, so we need some kind of new state machine that is capable of performing recursion. This is where
pushdown automata come in.
Pushdown Automata
A pushdown automaton is essentially the same thing as an NFA, except it also has memory in the form of a stack. Stacks make recursion possible - whenever you write a recursive function in a programming language, you are stacking function calls on top of each other that eventually unwind back down to the bottom. In the case of a pushdown automaton, symbols are placed on and removed from the stack which can be used to influence the behavior of the state machine.
Here is one possible configuration of a pushdown automaton for our language:
This looks a lot like an NFA, but we have a little bit of extra notation going on. Each transition does more than just look at a character - it also does one or two stack operations (or none). The symbol to the left of the semicolon is the expected character, which is consumed in the same manner as in an NFA. Immediately to the right of the semicolon and to the left of the arrow, we identify what character we expect to pop off the stack. To the right of the arrow, we identify the character we want to push onto the stack. An epsilon placed to the left or right of the arrow indicates that nothing is popped off of or pushed onto the stack, respectively.
A PDA has two alphabets: the alphabet of the examined string that it is accepting or rejecting, and the alphabet of the stack. In this case, the string alphabet is the same as the regular language we were dealing with - {
a,
b}. The stack alphabet contains two characters: a dollar symbol to indicate the bottom of the stack, and a lambda, which is pushed on the stack for every encountered
a and popped off the stack for every encountered
b.
We read the diagram in this fashion: Begin at state q
0. Push a dollar symbol onto the stack to indicate its bottom and move to state q
1. If the string is empty, we're done - just accept it. Otherwise, if we encounter an
a, push a lambda onto the stack and move to state q
2. Keep doing this and remain at the same state bubble until we encounter a
b. In that case, pop a lambda off the stack and move to state q
3. Keep doing this for every
b we encounter. Finally, if we reach the dollar symbol at the bottom of the stack, move to state q
4. If there are no more characters in the string, accept the string. If at any point an unexpected character is encountered (in either the string or on the stack) the PDA will reject the string, just like an NFA.
All transitions are atomic - they can only take place if all of the conditions are met. For example, for the transition between q
3 and itself, if we encounter a
b in the string but we pop a dollar sign off the stack instead of a lambda, then we know that there are more instances of
b in the string than
a. Such a string must be rejected.
A standard PDA alone is not capable of detecting all resursive structures, but it will suffice in most cases. To explore its insufficiency, let's talk about
context-sensitive grammars.
Context-Sensitive Grammars
There are some languages that require a little more information about surrounding symbols to be able to be recognized or generated. In other words, they require
context. The primary difference between these and context-free grammars is that multiple terminal or nonterminal symbols are allowed to appear on the left hand side of a production rule. Context-free grammars are not given this freedom.
To give an example of a language that is context-sensitive, consider adding an additional requirement to the language we were dealing with before. A new symbol,
c, is added to the string alphabet, which much also appear n times at the end of the string. For example,
aaabbbccc. A pushdown automaton would not be able to recognize this language, because once every instance of lambda has been popped off the stack and we reach the first
c we will have forgotten what n is.
The following production rules generate this language:
$$S \rightarrow aSBC$$
$$S \rightarrow \epsilon$$
$$CB \rightarrow CZ$$
$$CZ \rightarrow WZ$$
$$WZ \rightarrow WC$$
$$WC \rightarrow BC$$
$$aB \rightarrow ab$$
$$bB \rightarrow bb$$
$$bC \rightarrow bc$$
$$cC \rightarrow cc$$
There a quite a few things to talk about here. The first is that there is an additional stipulation placed on production rules in a context-sensitive grammar than an
unrestrictive grammar, which is one level higher in complexity. All production rules must replace at least one nonterminal symbol with a symbol not present on the left hand side. That is to say, the following production rule...
$$CB \rightarrow BC$$
...is not allowed, because it is simply shifting the order of nonterminals rather than introducing some new symbol. We get around this restriction by expanding this into four rules, employing a specific tactic known as
Revesz' trick. To demonstrate how this grammar works, consider the following generation of the string
aabbcc:
S
=> aSBC
=> aaSBCBC
=> aaBCBC
=> aaBCZC
=> aaBWZC
=> aaBWCC
=> aaBBCC
=> aabBCC
=> aabbCC
=> aabbcC
=> aabbcc
In each step I have bolded the symbols which have been transformed by a production rule. The fourth iteration invokes the S to epsilon rule, and so is not bolded.
The interesting thing about recognition of context-sensitive languages is that a pushdown automaton is capable of computing them. However, the pushdown automaton must have not just one stack, but two or more stacks. Additional stacks beyond a count of two do not give pushdown automata more power, however they can reduce the number of states and transitions required. Pushdown automata with multiple stacks are abbreviated as N-PDAs, where N is some integer number between 2 and infinity. Describing such PDAs is trivial, so instead I will demonstrate the usage of a different kind of recognizer: a turing machine.
Turing Machines
While you may or may not be familiar with how one works, chances are you've heard of turing machines. They are named after the most famous computer scientist of all, Alan Turing, who aided the allied forces in the second world war in decrypting Nazi communications. A turing machine is capable of performing any algorithm thrown at it, and any set of machine instructions that are able to emulate a turing machine are considered to be turing complete. Pretty much every computer you use today is some sort of an emulation of a turing machine.
A physical description of a turing machine is required to really understand its concept. It consists of a tape which is assumed to be infinite in length, and is divided discretely such that any number of symbols can be inscribed onto it. The tape can be moved left or right so that any symbol at any position can be examined at any time. Essentially, this is a form of random access memory, named as such because divisions of the memory can be accessed at random with no ordering requirements. Stack memory is incapable of this, since we have to pop items off and forget them in order to get to something underneath.
When a string is fed into a turing machine, it is printed on the tape from left to right. The turing machine will overwrite parts of the tape (which may not even include the string), eventually accepting or rejecting the string it was given. A state machine diagram for a turing machine capable of recognizing our context-sensitive language is shown below:
The machine begins at the left end of the tape, where the first character in the string is located. If the string is an empty string, then that part of the tape will be blank - indicated by the square cup symbol. If the string is non-empty, then we should encounter an a first. Like with an NFA, a TM will halt if it encounters anything unexpected and reject the string. Otherwise, when we encounter that a, overwrite it with an x.
Next, move past all the a characters until we reach the first b. Replace it with a y. Move to the first c. Replace it with a z, and now work backwards to the beginning of the tape where we placed an x. Move right, and if we find an a, repeat this process of replacing characters. We will be moving right past y and z characters on repetition, just ignore them. Eventually there will come a point where there is a xy substring on the tape. This means we have consumed every a. If this is the case, then we should have consumed every b and c by turning them into y and z. If that's all we encounter until we reach the right end of the tape, which is blank, then the string should be accepted as it fits our pattern.
For the string aaabbbccc, we would see the following tape history:
aaabbbccc
xaabbbccc
xaaybbccc
xaaybbzcc
xxaybbzcc
xxayybzcc
xxayybzzc
xxxyybzzc
xxxyyyzzc
xxxyyyzzz
If this was the 1940's and we were still using magnetic tapes, we would know that our string was accepted based on the fact that only x, y, and z characters are left on the tape, besides the blank slots. You could probably imagine doing the same thing with a string in some programming language - if so, you would be using a computer that is turing complete, which is all of them.
Recursively Enumerable Languages and Beyond
By now you're probably wondering about that last bubble in the Chomsky hierarchy diagram - you know, the big one. Well, the situation is rather interesting. We started off trying to avoid having to list every string in our language one by one, because of how inconvenient that was. Recursively enumerable languages are those languages that contain strings that must be listed, like iterating through an array. So we're back to square one.
The problem here, as I've described before, is that a string can be of any length. Therefore, it is possible for a language's set to be infinite in size. In fact, that is true of every language we have looked at so far. Languages that lie outside of the set of context-sensitive languages are therefore problematic and present a host of complicated issues which are usually addressed in the second half of a theory of computation course. This second half then becomes less about languages and more about abstract problems which lie outside the scope of this lesson.
So, that concludes this topic. Hopefully this was able to give you a bit more confidence as you head into one of the more theoretical courses in the typical computer science program. Good luck and thank you for reading.