This assignment is intended to give you experience with implementing the Dictionary (Table) ADT with both tries and hash tables, and comparing their performance.
The laws of probability indicate that if you left some monkeys in a room with some typewriters for long enough, eventually they would reproduce Shakespeare. This assignment asks you to write a program that tries to speed up the process. Instead of producing an output text in which each character is chosen independently and totally at random, the program tries to "learn" from an input text about what kinds of sequences of characters are likely to occur in English.
First, suppose you want to implement a simple version of this program that just mimics the distribution of the 26 letters of the alphabet and the space character in the input text.
Scan through the input text, keeping track of how many times each character occurs. During this process, you will update a CharDistribution object using a method called "occurs" that will increment a counter for character c every time c occurs.
After the distribution has been initialized in this way, you can obtain characters from the CharDistribution object using a method "getRandomChar". This method will return a character chosen randomly with the appropriate frequency. For example, if the input text has 100 characters and the letter "a" occurred 20 times, then the probability that the function returns "a" would be .2 (20%).
To implement getRandomChar, choose a number at random between 1 and the size of the input text, such that each number has equal probability of being chosen. Then scan through the array of counters, keeping a cumulative sum of character occurrences. When the sum exceeds the random number, return the associated character.
To generate the output text for the simple case, call getRandomChar the required number of times, after initializing the CharDistribution object on the input text.
The Real Case
The simple case described above generates a series of characters in which each individual character occurs with the same frequency as it does in the input text. However, the combinations of characters will not reflect the input text (i.e., will not be plausible English). To capture the frequency with which different character combinations occur, we need to alter this approach.
When scanning the input text, you must create several different CharDistribution objects. The number of CharDistribution objects depends on the window size w provided as input to your program. First, suppose w = 1. Then we will need 27 different character distributions: the first one records the frequency with which each of the 27 different characters follows "a" in the input text; the second one records the frequency with which each of the 27 different characters follows "b" in the input text; etc.
In general, when the window size is w, potentially 27^w different character distributions are needed: one that records the frequency with which different characters follow "aaa...a"; one that records the frequency with which different characters follow "aaa...b"; etc.
Although 27^w is huge even for relatively small values of w, almost all of these distributions are not needed! For instance, it is extremely unlikely that "aaa...a" will occur in the input text, and if it does not occur, then the distribution of characters following "aaa...a" is not needed.
We will keep the different distribution objects in a Dictionary (Table) data structure, where the key for each distribution object is the window that it corresponds to.
After scanning the input text, creating the different character distribution objects and storing them in a dictionary, the output text is generated as follows:
the first w characters of the output are the first w characters of the input text repeat let the window be the last w characters that have been generated use the window as the key to search the dictionary and obtain a character distribution call the getRandomChar method for that distribution to get the next character to be generated until desired lengthNote: To prevent difficulties due to a distribution of all zeroes, the last w characters of the input text must also occur earlier in the text.
You will have to experiment with the table size. It should be chosen such that the table is roughly 75% full.
Updated on April 7, 2004.
These requirements will affect your implementation, because you will have to include code to do these measurements.
Due by 10:20 AM on Monday, April 19, 2004:
Postponed to 10:20 AM on Wednesday, April 21, 2004
Postponed again to 10:20 AM on Friday, April 23, 2004