Saturday, December 31, 2016

MIPS: Millions of Instructions Per Second (Part 1)

Programming in an assembly language is an arduous task, but it is definitely rewarding.  Knowing what is happening at the hardware level of a computer gives one a sense of the intricacy of history's most complicated machine.  It will also make you appreciate higher level languages and never want to live without them ever again.

Luckily you are reading this blog because you are being taught how to program in MIPS, one of the easier assembly languages to pick up.  MIPS is a RISC architecture, which stands for reduced instruction set computer.  Everything that you program in MIPS is some combination of very simple commands, such as "add these two numbers together" or "jump to this instruction."  More complex instruction sets exist - such as x86 - that include specialized extensions.  These extensions improve the speed and efficiency of a program, but require more knowledge about their applications.

 

How an Assembly Language Works

 

An assembly language is a specific type of programming language with minimal abstraction between the hardware and code.  It is defined by its instruction set - a list of commands to the computer and how they function.  Each instruction assembles to some number of machine instructions, which exist in a binary format that the computer can actually use to execute a program.  Typically there is a one-to-one mapping between an instruction and an associated machine instruction, but there exist pseudo-instructions which assemble to multiple machine instructions that must be executed consecutively.

Code written in an assembly language has to be transformed into machine code in order to be executed.  We call this transformation assembly, hence the name given to this category of programming languages.  Much, much more happens past this level, but this post will focus exclusively on the programming aspect of MIPS.  Typically an instruction set will have an assembler that is maintained by the company that manufactures the chips used by that instruction set.  However, instead of using this assembler we will be using an integrated development environment called MARS, which will simulate a MIPS processor.

 

MARS IDE

 

MARS is developed and maintained by Missouri State University and can be downloaded here.  It requires Java to run, so make sure you have that installed.  When you first open MARS you will be treated to a screen that looks something like this:


A quick rundown of what you're seeing: The area at the right shows the current contents of each register, and is very useful for debugging.  Registers are basically caches of very high-speed memory that are found directly within a MIPS processor (or any other type of processor).  All instructions that do any sort of calculation must do so on registers.  MIPS has 32 registers accessible to the programmer - each one functions exactly the same in hardware terms, but from a software standpoint have a specific purpose and should be used according to a standardized convention.  That's not important quite yet, but we'll get to it.

The area at the bottom is the console and functions basically the same way as in Eclipse or Netbeans.  There are two tabs - the first will tell you any problems the assembler has with your code, and the second is the standard output of your program.  Finally, the area in the middle is of course where you'll be writing your code.  Everything else is pretty self-explanatory - feel free to explore a bit more before we begin.

 

Your First Program

 

Create a new file by going to File > New.  Save your file and name it anything you want.

Getting right into it, there are two main areas of your program: the .text section, and the .data section.  The .text section is where your code will go, while the .data section is where you can declare variables stored in memory.  We'll just stick to the .text section for now, but go ahead and create both sections by writing .text at the top of your file, and .data some number of lines beneath, like this:


The MARS editor will highlight and recognize these as directives, special commands that the assembler will deal with when building your code.  In this case, the .text and .data directives indicate what sort of content the assembler should expect next.

Below the .text directive, write the following lines:


Both of these lines are instructions.  The first line is a load immediate.  All this does is set the value in the specified register, in this case $v0, to some number.  An immediate is some constant number written in code, as opposed to a variable stored in a register.  Here our immediate is 10.

The second line is exactly what it says it is: a syscall.  You can think of a syscall as a special function that performs some sort of action that is important to the system using the MIPS processor.  One example of a syscall is printing a number to the console, or in our case exiting the program.

You can go ahead and build the program by clicking the icon on the toolbar at the top.  This will automatically switch you to the execute tab and away from the editor.  The middle of MARS will look like this now:


Just like your program, the execute tab is split into a text segment and a data segment.  The text segment shows the raw instructions that your program assembled to, while the data segment gives you a view of every location in main memory.  In the text segment you can set breakpoints, which will pause your program while it is running so that you can see what state the machine is in.  This will be extremely useful when (not if) you run into a bug that you can't catch just by looking at the code.  Be prepared to use this feature.  A lot.

There's also the program arguments field.  If you've programmed in C or C++ before, this is the same thing as the command line.  If you haven't, don't worry - we'll cover what this does later.

Now go ahead and run your program by clicking the button on the toolbar.  Not much will happen, since all we've done is told the program to exit as soon as it starts.  You'll see the following printed out in the console:


Were we not to exit the program using the syscall, you would see this instead:


The reason there's this distinction between just letting the program reach the end and manually exiting is that when you start writing functions at the bottom of your file, MARS will run straight through them, probably causing errors.  Using the exit syscall prevents this from happening.

 

Hello World

 

Alright, now we're past all the boring stuff.  Let's write some actual code.  In your .data section, allocate a string variable hello_world:


Every variable in your data section has a label.  Our string has the label hello_world, and can be used in the text section by referring to that label.  Any time a variable is declared in the data section, it has to have a type directive.  An .asciiz is a null-terminated ASCII string.  A null-terminated string is any string that is stored with a null terminator as its last character.  A null terminator is just the value zero, also called null, and is used as a way of denoting where the string stops in memory.  ASCII is one way of encoding letters and symbols using binary numbers.  It was designed specifically for the English language, and is only capable of representing up to 256 different symbols.  Other encodings exist that can represent pretty much any symbol used to communicate, but they are vastly more complicated.

Like all other hello world programs, our goal is to print our string to the console.  This is rather simple:


This is yet another syscall.  The first thing this does is load address our string into the register $a0.  Our string exists in memory at a certain location - the address of our string is that location.  We often refer to addresses as pointers, because they point to something in memory.  Loading the immediate value 4 into $v0 specifies the type of syscall we want.  Syscall 4 prints a string to the console, located at the address in $a0.

Running this program, you will get the following output:


As expected, our string appears in the console.

 

Loops

 

Now let's do the same thing, but some variable number of times.


We start off by loading two values in two separate registers.  The meaning of the value stored in each register is noted by the comments, which begin with the pound character (#).  Line 6 is just the label loop.  Labels in the text segment can be referred to by instructions - this enables us to jump to some point in the code from some other point in the code, without having to specify a line number.

The next line is a branch on equal instruction.  This instruction compares two registers and jumps to the given label if they have the same value.  There are several different branching instructions that check for different logical relationships between two values.

Line 11 is an add immediate instruction.  This instruction takes three arguments: two registers and an immediate value.  The first register is the destination register - where the result is stored.  The second register is a variable, and the immediate is how much we're adding to that register.  In this case and in most cases, we are incrementing a register with this instruction - the destination and source register arguments are therefore the same, $t0.

Line 12 is a branch instruction.  A plain branch instruction takes no register arguments and always jumps to the label it's given.  It functions in exactly the same way as a jump instruction (j).  The difference between the two is unimportant for a programming guide.

This program is equivalent to the Java program below:

 

Factorial Program

 

Let's use loops to perform some sort of calculation.


Some new things: Line 8 has been changed to a bgt instruction, which means branch greater than.  This should be pretty self-explanatory.  Lines 9 and 10 work together to multiply $t2 by $t0 and load the result back into $t2.  The mult instruction is a bit different in how it works compared to addi.  There are two auxiliary registers, $lo and $hi, that together store a contiguous 64-bit result.  Each register in MIPS is 32 bits, but multiplication can result in a larger number pretty quickly, so twice as much space is set aside.  We're assuming that our number is small enough to fit in 32 bits, so we just load the lower half with the mflo instruction, which just copies from $lo into some other register.  Line 16 uses the move instruction, which just copies the value in the second register argument to the first register argument (from $t2 to $a0 in our case).

Compare to the Java equivalent:

 

Loop Style

 

Now that we've seen a couple example programs with loops, it might be a good time to talk about the way loops are normally formatted.  In MIPS assembly, everything is a while loop.  If you need to have a counter, you must initialize it before the loop begins.  At the very top of the loop, you have some label which indicates the beginning of the loop.  Then you have some condition (or set of conditions) that, when it becomes true, the loop is broken.  At the bottom of the loop you have a branch or jump instruction that goes to the top of the loop.  If you are emulating a for loop, then you increment your counter right before you branch to the top.  In other words:


This sort of pattern will show up frequently in your programs.  One extra thing to note - branch instructions accept either registers or immediates as their second argument.  So while in the previous programs I loaded a value into a register to compare to, I could just as easily have used an immediate value instead, as I have here.  This is something unique to branch instructions.

 

Non-immediate Instructions

 

We have been using addi to do all of our addition so far, but there are times when you might need to add two variables together.  Immediates cannot be changed at runtime, so the second variable has to be read from another register.  The add instruction does this, and is the instruction that addi is based on.  You can think of immediate instructions as slight alterations of non-immediate instructions.  The non-immediate instructions were designed first, and their immediate versions added as an additional consideration.



The add instruction stores the result of $t1 + $t2 in register $t0, the first argument.  The sub instruction does the same exact thing, except it calculates subtraction (and in this case is storing $t4 - $t5 in $t3).  Other instructions like this exist, which I will address in later installments.

 

To Be Continued

 

In the next part, we will cover if statements and memory operations.

Thursday, December 29, 2016

Colorless Green Ideas Sleep Furiously

A theory of computation course normally begins by talking about the concept of formal languages.  At first it might seem like the idea has little to do with computers, but it's probably one of the most important bits of theory in all of computer science.  An understanding of formal languages is necessary for writing compilers and designing natural language processing algorithms, but even more importantly they allow us to determine what sort of problems can be solved by computers at all.  This post will cover the background of formal languages and automata to make it easier to understand the more difficult material that may show up in a theory of computation class.

 

What are formal languages?

 

Formal languages are distinct from both programming languages and natural languages - the type of language that you and I speak.  In the theory of computation, they are normally referred to simply as languages, since it is usually clear what we mean when we mention them.  A language is a set of rules that govern the structure of strings, which are composed of alphabets.  A language has an associated automaton, which can be used equivalently to express the set of strings contained in a language.  Strings in the set of a language are said to be accepted by its automaton, while strings in the language's complement (those not included in the language) are said to be rejected.

To illustrate what all of this means, let's start with an example.

I have an alphabet which contains the letters a and b.  With these two symbols, as they're called, I can construct an infinite number of strings: a, b, aa, ab, ba, bb, aaa, aab, and so on.  But I don't want to build just any string out of these symbols, I want to be able to build a string that has a certain characteristic.  I want only those strings that do not contain repeated symbols - no appearance of the substring aa or bb.  If I wanted to, I could simply list all of these strings forever until the end of time: a, b, ab, ba, aba, bab, abab, baba...

But this is really inconvenient.  I would much rather define a set of rules that could be used to generate these strings, so that I can capture the essence of my language with a finite amount of effort.  I can explain my language in plain English, as I have done already, or I can use a more formal mathematical notation which is universally recognized by all computer scientists.  In this case, my language is simple enough to be described using a regular expression.  My language is in the set of regular languages.

 

Regular Languages

 

There are different classifications of languages.  Regular languages are the simplest kind of languages, consisting of rules that require no memory to verify.  That is to say, I only need to look one symbol to the left to know if my string is violating the rules of my language.  To verify a string in the language we were just dealing with, the idea is simple: look at each symbol starting at the first character and ending at the last character.  For each, make sure that the previous character is different from the current one.  If we reach the end without any problems, accept the string.  Otherwise, reject it.

Describing this process using a regular expression ensures that there is no confusion among other computer scientists what strings my language accepts.  In formal notation, the regular expression for this language is the following:

$$(ab)^*a \: | \: (ba)^*b \: | \: (ab)^* \: | \: (ba)^*$$

There are several key mathematical symbols in this expression, each of which specify some sort of operation.  Parentheses are used just as we would use them in arithmetic - they modify the order of operations by grouping symbols and other operations together, recursively.  Operations within a set of parentheses must be performed first.  Then, we have what's called the kleene star, represented as an asterisk.  Symbols operated on by a kleene star can be repeated any number of times, from zero to infinity.  Finally, we have the union symbol, represented by a vertical bar.  Just like in set theory, unions in regular expressions combine two languages together to make a more general language that accepts strings in either one.

In this particular expression, there are four general possibilities.  We can have a string consisting of repeating ab substrings, ending in a.  We can have a string with repeating ba substrings, ending in b.  Or we can have repeating ab or ba substrings, with no a or b at the end, respectively.  Note that the strings a and b are in the language, because the substring in the parentheses can be repeated zero times, resulting in an empty string.  In fact, we can just have an empty string with nothing in it at all.

There are other operations besides those previously described.  A plus sign indicates the same thing as a kleene star, except there must be at least one occurrence of the substring being operated on.  A question mark indicates zero or one occurrence.  A substring raised to the nth power is repeated exactly n times.  Even more operations exist, but a theory of computation course will usually limit itself to the aforementioned selection and focus on the core concepts, rather than force you to memorize the notation for every possible operation.

Now suppose we wanted to construct some kind of machine that would recognize strings in our language. This machine would cycle through a finite set of states, moving between them every time it encounters a symbol in the input string.  The machine would begin at some state, and end at some other state, either accepting the string or rejecting it if the language's rules have been violated.  There are several different types of machines than can perform this task.  The simplest one is called a deterministic finite automaton.

 

Deterministic Finite Automata

 

First, some etymology.  Determinism is the tendency of a system to always behave the same way given the same starting conditions and input.  This is in opposite to Indeterminism, which is the tendency of a system to perhaps behave differently even though the same starting conditions and input may be given to it.  A deterministic finite automaton is a finite state machine that is predictable because it is never given a choice between two possibilities - when analyzing a string, it must always proceed in the same manner every time.  This concept may be a bit esoteric for the time being, but we will return to it later when discussing a different type of finite automaton.

Taking our language and constructing a DFA for it, we get the following state machine diagram:


What this diagram is telling us is this: First, initialize the machine with state q0.  As we encounter each new symbol in the input string, have the machine react by moving to the appropriate state.  If by the time we reach the end of the string we are in an accept state (indicated by double circles), accept the string.  Otherwise, reject the string.

Giving this DFA the valid input string abab, we would observe the state history q0 => q1 => q2 => q1 => q2.  Because this run of the automaton ended on q2 and q2 is an accept state, it would accept the string.  Giving the input string abb would result in rejection because the automaton would halt on state q4, which is not an accept state.

Deterministic finite automata are only able to recognize the set of regular languages, because they have no memory.  If we were to take this diagram and construct an actual physical machine capable of performing string recognition for this language, we would only need one register to store the current state of the machine, and nothing else.  This is of course ignoring whatever mechanism is being used to feed the string to the machine.

Bringing back the concept of determinism, it is clear that DFAs do not make a choice between a set of paths.  There is only one path the DFA can take to the next state from its current state for each symbol in the alphabet.  This is what makes them deterministic.  But the DFA has an indeterministic cousin, the nondeterministic finite automaton, which is able to get around this limitation.

 

Nondeterministic Finite Automata

 

Nondeterministic finite automata introduce the concept of multiple concurrent pathways to finite state machines.  While DFAs are forced to be complete and can only have one transition per symbol in the alphabet, NFAs are not required to force these restrictions on themselves.  NFAs can be incomplete, meaning that they do not have to define a transition away from a state for every symbol in the alphabet, while DFAs are required to (in most contexts - some theory of computation courses opt not to put this restriction on DFAs).  They can also define special epsilon transitions.

An NFA for our language might look like this:


Just like a DFA, an NFA like this one looks at one character in the string at a time from left to right, and moves between states on each encountered symbol.  The epsilon symbol, seen branching four times from state q0, represents the empty string.  When an NFA takes an epsilon state transition, it does not consume a symbol in the string.  Rather, it moves to the corresponding state without even looking at whatever character is next.

It is also permitted to have multiple transitions of the same symbol exiting the same state - whenever this occurs, the NFA is given multiple choices and can take either transition.  The benefit to this behavior is that only one path needs to successfully recognize the string in order for it to be accepted.  If the NFA takes a path that leads to a rejection, it can roll itself back and try some other path instead.  If no path leads to acceptance, then the string is rejected.

If an NFA encounters a symbol that does not have a transition away from its current state, the machine will halt and reject the string (or roll itself back and try another path).  In some theory of computation courses, this behavior is also granted to DFAs (the halting part).  In that case, the only features discerning an NFA from a DFA is the allowance of branching and epsilon transitions.

It is important to realize that there are multiple ways of building automata for the same language.  This NFA has more states than the DFA we built, but it better captures the union operations in the regular expression.  It is entirely possible to build an NFA with fewer states and transitions than this one.  In fact, NFAs usually have fewer states than DFAs due to the reduced number of restrictions put on them.

Like DFAs, NFAs recognize only regular languages.  It can be said that DFAs, NFAs, and regular expressions are equally powerful.  The more languages that can be recognized by an automaton, the more powerful they are said to be.  There are other types of languages that require more powerful automata than the ones just discussed.

 

The Power of Languages

 

What is it about certain languages that require a more powerful automaton to recognize, exactly?  Is it that languages that require more power accept more strings?  Well, as discussed when we were first choosing a language to implement, languages are actually restrictive on input.  A weak automaton would just accept any old string that was passed its way, because it is less able to recognize certain patterns within strings.  So languages that require more power actually accept fewer strings, not more.

There is an entire branch of computational theory that deals with the complexity of systems in general.  Languages are included in this analysis, and have their own hierarchy - the Chomsky hierarchy - named after computer scientist and philosopher Noam Chomsky.  Below we see the structure of this hierarchy:


We have covered the smallest circle in this diagram so far.  Notice that the higher classifications of languages encompass each other.  This means that more powerful automata are able to recognize every language that their less powerful counterparts can, and more.  An automaton that is able to recognize context-free languages is necessarily able to recognize regular languages as well.  Likewise, an automaton capable of recognizing context-sensitive languages is also capable of recognizing context-free languages, and so on.

In order to demonstrate exactly what kind of languages require these automata, let's move on to context-free grammars.

 

Context-Free Grammars

 

Regular languages are limited in a way that context-free grammars are not.  A regular expression, DFA, or NFA is incapable of expressing any sort of recursive structure.  Context-free grammars were specifically designed for this purpose.  A context-free grammar consists of a set of production rules which replace nonterminal symbols with terminal symbols or other nonterminal symbols.  Recursion can be accomplished by creating a production rule that replaces a nonterminal symbol with itself, plus other symbols.

Take, for example, a language which accepts strings that consist of a repeated some number of times, followed by b repeated that same number of times.  That is, anbn where n is any positive integer number.  This language is clearly not regular - a DFA or an NFA can't remember how many times it has seen the symbol a when it comes time to count how many times b shows up.  A context-free grammar, on the other hand, has no problem performing this task.  Take the set of production rules below:

$$S \rightarrow aSb$$

$$S \rightarrow \epsilon$$

The way these production rules work is simple - begin with the start symbol S, a nonterminal symbol.  Replace it with the right hand side in either rule until you are left with only terminal symbols (those in the alphabet).  When replacing with epsilon, just remove the nonterminal symbol entirely.  We can apply these rules to generate the string aaabbb: S => aSb => aaSbb => aaaSbbb => aaabbb.

This is great if we want to generate a string that requires this kind of recursive structure.  But we are still in need of an appropriate recognizer for these languages.  DFAs and NFAs won't do, so we need some kind of new state machine that is capable of performing recursion.  This is where pushdown automata come in.

 

Pushdown Automata

 

A pushdown automaton is essentially the same thing as an NFA, except it also has memory in the form of a stack.  Stacks make recursion possible - whenever you write a recursive function in a programming language, you are stacking function calls on top of each other that eventually unwind back down to the bottom.  In the case of a pushdown automaton, symbols are placed on and removed from the stack which can be used to influence the behavior of the state machine.

Here is one possible configuration of a pushdown automaton for our language:


This looks a lot like an NFA, but we have a little bit of extra notation going on.  Each transition does more than just look at a character - it also does one or two stack operations (or none).  The symbol to the left of the semicolon is the expected character, which is consumed in the same manner as in an NFA.  Immediately to the right of the semicolon and to the left of the arrow, we identify what character we expect to pop off the stack.  To the right of the arrow, we identify the character we want to push onto the stack.  An epsilon placed to the left or right of the arrow indicates that nothing is popped off of or pushed onto the stack, respectively.

A PDA has two alphabets: the alphabet of the examined string that it is accepting or rejecting, and the alphabet of the stack.  In this case, the string alphabet is the same as the regular language we were dealing with - {a, b}.  The stack alphabet contains two characters: a dollar symbol to indicate the bottom of the stack, and a lambda, which is pushed on the stack for every encountered a and popped off the stack for every encountered b.

We read the diagram in this fashion: Begin at state q0.  Push a dollar symbol onto the stack to indicate its bottom and move to state q1.  If the string is empty, we're done - just accept it.  Otherwise, if we encounter an a, push a lambda onto the stack and move to state q2.  Keep doing this and remain at the same state bubble until we encounter a b.  In that case, pop a lambda off the stack and move to state q3.  Keep doing this for every b we encounter.  Finally, if we reach the dollar symbol at the bottom of the stack, move to state q4.  If there are no more characters in the string, accept the string.  If at any point an unexpected character is encountered (in either the string or on the stack) the PDA will reject the string, just like an NFA.

All transitions are atomic - they can only take place if all of the conditions are met.  For example, for the transition between q3 and itself, if we encounter a b in the string but we pop a dollar sign off the stack instead of a lambda, then we know that there are more instances of b in the string than a.  Such a string must be rejected.

A standard PDA alone is not capable of detecting all resursive structures, but it will suffice in most cases.  To explore its insufficiency, let's talk about context-sensitive grammars.

 

Context-Sensitive Grammars

 

There are some languages that require a little more information about surrounding symbols to be able to be recognized or generated.  In other words, they require context.  The primary difference between these and context-free grammars is that multiple terminal or nonterminal symbols are allowed to appear on the left hand side of a production rule.  Context-free grammars are not given this freedom.

To give an example of a language that is context-sensitive, consider adding an additional requirement to the language we were dealing with before.  A new symbol, c, is added to the string alphabet, which much also appear n times at the end of the string.  For example, aaabbbccc.  A pushdown automaton would not be able to recognize this language, because once every instance of lambda has been popped off the stack and we reach the first c we will have forgotten what n is.

The following production rules generate this language:

$$S \rightarrow aSBC$$

$$S \rightarrow \epsilon$$

$$CB \rightarrow CZ$$

$$CZ \rightarrow WZ$$

$$WZ \rightarrow WC$$

$$WC \rightarrow BC$$

$$aB \rightarrow ab$$

$$bB \rightarrow bb$$

$$bC \rightarrow bc$$

$$cC \rightarrow cc$$

There a quite a few things to talk about here.  The first is that there is an additional stipulation placed on production rules in a context-sensitive grammar than an unrestrictive grammar, which is one level higher in complexity.  All production rules must replace at least one nonterminal symbol with a symbol not present on the left hand side.  That is to say, the following production rule...

$$CB \rightarrow BC$$

...is not allowed, because it is simply shifting the order of nonterminals rather than introducing some new symbol.  We get around this restriction by expanding this into four rules, employing a specific tactic known as Revesz' trick.  To demonstrate how this grammar works, consider the following generation of the string aabbcc:

S
=> aSBC
=> aaSBCBC
=> aaBCBC
=> aaBCZC
=> aaBWZC
=> aaBWCC
=> aaBBCC
=> aabBCC
=> aabbCC
=> aabbcC
=> aabbcc

In each step I have bolded the symbols which have been transformed by a production rule.  The fourth iteration invokes the S to epsilon rule, and so is not bolded.

The interesting thing about recognition of context-sensitive languages is that a pushdown automaton is capable of computing them.  However, the pushdown automaton must have not just one stack, but two or more stacks.  Additional stacks beyond a count of two do not give pushdown automata more power, however they can reduce the number of states and transitions required.  Pushdown automata with multiple stacks are abbreviated as N-PDAs, where N is some integer number between 2 and infinity.  Describing such PDAs is trivial, so instead I will demonstrate the usage of a different kind of recognizer: a turing machine.

 

Turing Machines

 

While you may or may not be familiar with how one works, chances are you've heard of turing machines.  They are named after the most famous computer scientist of all, Alan Turing, who aided the allied forces in the second world war in decrypting Nazi communications.  A turing machine is capable of performing any algorithm thrown at it, and any set of machine instructions that are able to emulate a turing machine are considered to be turing complete.  Pretty much every computer you use today is some sort of an emulation of a turing machine.

A physical description of a turing machine is required to really understand its concept.  It consists of a tape which is assumed to be infinite in length, and is divided discretely such that any number of symbols can be inscribed onto it.  The tape can be moved left or right so that any symbol at any position can be examined at any time.  Essentially, this is a form of random access memory, named as such because divisions of the memory can be accessed at random with no ordering requirements.  Stack memory is incapable of this, since we have to pop items off and forget them in order to get to something underneath.

When a string is fed into a turing machine, it is printed on the tape from left to right.  The turing machine will overwrite parts of the tape (which may not even include the string), eventually accepting or rejecting the string it was given.  A state machine diagram for a turing machine capable of recognizing our context-sensitive language is shown below:


The machine begins at the left end of the tape, where the first character in the string is located.  If the string is an empty string, then that part of the tape will be blank - indicated by the square cup symbol.  If the string is non-empty, then we should encounter an a first.  Like with an NFA, a TM will halt if it encounters anything unexpected and reject the string.  Otherwise, when we encounter that a, overwrite it with an x.

Next, move past all the a characters until we reach the first b.  Replace it with a y.  Move to the first c.  Replace it with a z, and now work backwards to the beginning of the tape where we placed an x.  Move right, and if we find an a, repeat this process of replacing characters.  We will be moving right past y and z characters on repetition, just ignore them.  Eventually there will come a point where there is a xy substring on the tape.  This means we have consumed every a.  If this is the case, then we should have consumed every b and c by turning them into y and z.  If that's all we encounter until we reach the right end of the tape, which is blank, then the string should be accepted as it fits our pattern.

For the string aaabbbccc, we would see the following tape history:

aaabbbccc
xaabbbccc
xaaybbccc
xaaybbzcc
xxaybbzcc
xxayybzcc
xxayybzzc
xxxyybzzc
xxxyyyzzc
xxxyyyzzz

If this was the 1940's and we were still using magnetic tapes, we would know that our string was accepted based on the fact that only x, y, and z characters are left on the tape, besides the blank slots.  You could probably imagine doing the same thing with a string in some programming language - if so, you would be using a computer that is turing complete, which is all of them.

 

Recursively Enumerable Languages and Beyond

 

By now you're probably wondering about that last bubble in the Chomsky hierarchy diagram - you know, the big one.  Well, the situation is rather interesting.  We started off trying to avoid having to list every string in our language one by one, because of how inconvenient that was.  Recursively enumerable languages are those languages that contain strings that must be listed, like iterating through an array.  So we're back to square one.

The problem here, as I've described before, is that a string can be of any length.  Therefore, it is possible for a language's set to be infinite in size.  In fact, that is true of every language we have looked at so far.  Languages that lie outside of the set of context-sensitive languages are therefore problematic and present a host of complicated issues which are usually addressed in the second half of a theory of computation course.  This second half then becomes less about languages and more about abstract problems which lie outside the scope of this lesson.

So, that concludes this topic.  Hopefully this was able to give you a bit more confidence as you head into one of the more theoretical courses in the typical computer science program.  Good luck and thank you for reading.