February 9, 2021
TextGenerator.java
and any supporting files you created to Lab 5. If you attempt any Challenges upload the files for those and a text file named challenges.txt
describing what you did.This and the remaining labs in this course will be somewhat different than the labs we’ve done so far. They will each be worth 40 points, and you will have a week and a half to do them. These labs will be moderately larger than previous labs, but the extra time is primarily to give you the opportunity to devise and plan out your approach. As such, there will also have less in terms of step-by-step procedures to follow. Devoloping your independent problem-solving skills is crucial to using computers to solve problems, so these labs put an emphasis on practicing these skills.
In this lab you will write a program that analyzes the letter frequencies in text documents and then generate new documents based on those frequencies. Consider the following three quotes:
Call me Ishmael. Some years ago–never mind how long precisely–having repeatedly smelt the spleen respectfully, not to say reverentially, of a bad cold in his grego pockets, and throwing grim about with his tomahawk from the bowsprit?
Call me Ishmael. Some years ago–never mind how long precisely–having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
Call me Ishmael, said I to myself. We hast seen that the lesser man is far more prevalent than winds from the fishery.
The second quote is the first sentence of Herman Melville’s 1851 novel, Moby Dick. The other two quotes were generated in Melville’s style using a simple algorithm developed by Claude Shannon in 19482. You will implement Shannon’s algorithm in this lab.
The goals for this lab are
TextGenerator
.
Panopto view link: https://carleton.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=a9c98a3c-4dc0-4b90-8c23-accb012447b9
LetterInventory
class by Saturday, 2/13FrequencyTable
class by Monday, 2/15TextGenerator
constructor that builds up a frequency table by Wednesday, 2/17generate
method by Friday, 2/19You can, of course, opt to use HashMap
s directly in the TextGenerator
class instead of implementing LetterInventory
and FrequencyTable
, though I expect having these separate classes will make the code easier to write, test, and debug.
The following algorithm is based on letter probability distributions. Imagine taking the novel Tom Sawyer and counting the frequency with which each character occurs. You’d probably find that spaces are the most common, that the character 'e'
is fairly common, and that the character 'q'
is rather uncommon. After completing this level 0 analysis, you’d be able to produce Tom Sawyer-esque text based on character frequencies. The idea is to randomly select each character based on the probability of its occurrence in the text. This random text wouldn’t have much in common with the real novel, but characters would tend to appear with the same proportions. In fact, here’s an example of what a level 0 analysis might produce:
rla bsht eS ststofo hhfosdsdewno oe wee h .mr ae irii ela iad o r te u t mnyto onmalysnce, ifu en c fDwn oee iteo
Now imagine doing a slightly more sophisticated level 1 analysis by determining the frequency with which each character follows every other character. You would probably discover that 'h'
is more likely to follow 't'
than 'x'
, and you would probably discover that a space follows '.'
more frequently than ','
does.
More concretely, if you are analyzing the text "the theater is their thing"
, 'e'
appears after 'h'
three times, and 'i'
appears after 'h'
one time; no other letters appear after 'h'
. So, given this example,
'e'
follows 'h'
is 0.75
,'i'
follows 'h'
is 0.25
,'h'
is 0
.Using a level 1 analysis, you could produce randomly generated Tom Sawyer-esque text by 1) picking a character, and then 2) always choosing the next character based on the probability of the next character given the chosen character. Here’s an example of level 1 random text:
“Shand tucthiney m?” le ollds mind Theybooure He, he s whit Pereg lenigabo Jodind alllld ashanthe ainofevids tre lin–p asto oun theanthadomoere
Now imagine doing a level k analysis by determining the probability with which each character follows every possible sequence of characters of length k. For example, a level 5 analysis of Tom Sawyer would reveal that 'r'
follows "Sawye"
more frequently than any other character. After a level k analysis, you would be able to generate Tom Sawyer-esque text by always choosing the next character based on the previous k characters and the probabilities revealed by the analysis.
At only a moderate level of analysis (levels 5–7), randomly-generated text begins to take on many of the characteristics of the source text. It probably won’t make complete sense, but you’ll be able to tell that it was derived from Tom Sawyer as opposed to, say, Moby Dick. Here are some more examples:
(Level 2:) “Yess been.” for gothin, Tome oso; ing, in to weliss of an’te cle - armit. Papper a comeasione, and smomenty, fropeck hinticer, sid, a was Tom, be suck tied. He sis tred a youck to themen
(Level 4) en themself, Mr. Welshman, but him awoke, the balmy shore. I’ll give him that he couple overy because in the slated snufflindeed structure’s kind was rath. She said that the wound the door a fever eyes that WITH him.
(Level 6:) people had eaten, leaving. Come - didn’t stand it better judgment; His hands and bury it again, tramped herself! She’d never would be. He found her spite of anything the one was a prime feature sunset, and hit upon that of the forever.
(Level 8:) look-a-here - I told you before, Joe. I’ve heard a pin drop. The stillness was complete, however, this is awful crime, beyond the village was sufficient. He would be a good enough to get that night, Tom and Becky.
(Level 10:) you understanding that they don’t come around in the cave should get the word “beauteous” was over-fondled, and that together" and decided that he might as we used to do - it’s nobby fun. I’ll learn you."
Once we have processed input text and stored it in a map structure that allows us to check frequencies, we pick k letters (for example, the first k in the input text) to use as a seed for our new text. Then we choose subsequent characters based on the preceding k characters and their frequency information. We would need to pick a starting k letters that actually appear in the input text, so we have information about what letter should come next.
This lab is deliberately more open-ended than the previous few labs. It’s important to practice the skill of designing your own approach to a new problem. So instead of a step-by-step procedure, this section contains general guidance I think might be useful to you.
The class that will run your program, TextGenerator
, is the only one provided in the starter project. It must have the given constructor (public TextGenerator(int k, String filename)
) and method (public String generate(int length)
), as the grading test cases depend on those. Beyond that, you are free to use any design you choose. One possible approach is discussed below.
When thinking about the design, focus on what would constitute a good data structure for this problem. Your data structure needs to keep a table of info and be able to support two key operations:
You might try to save the frequency information in a big array, but the size will quickly become too large. For k = 2, you would need a three-dimensional array whose size is about 27,000 entries if you include blanks and punctuation. For larger k, the number of entries grows rapidly.
Instead, you could develop a FrequencyTable
class implemented with a HashMap
. The HashMap
could store character sequences (i.e., String
s) as the keys and LetterInventory
s as the values. The LetterInventory
is a class (which you would also develop) that keeps track of which characters appear after a given sequence, along with a frequency.
There are many ways to implement a LetterInventory
. A good start is… another HashMap
! Thus, the letter inventory’s HashMap
stores key-value pairs where each key is a single character (stored as a Character
) and each value is the number of times the given letter occurs after the k-character sequence with which the inventory is associated (stored as an Integer
). Think carefully about what methods the letter inventory should have.
Let’s say you have a field HashMap<String, LetterInventory> table
and you want to select a random key from the map. You might want to do something like this in order to pick an initial k-length string for generate
to start with (or better yet, implement a FrequencyTable
method that provides this functionality). Doing this turns out to be a little cumbersome in Java. The first step is to get an array of all the keys:
Object[] keys = table.keySet().toArray();
keys
is type Object[]
because the toArray
method returns an array of Object
s. Next we can use the StdRandom
class from the algs4 libray to generate a random index for this array:
String randomKey = (String) keys[StdRandom.uniform(keys.length)];
As part of this frequency analysis approach, you will be faced with the situation of having characters and associated counts (in say, your LetterInventory
class), and needing to select a character with probability proportional to its frequency (see Level 1 analysis for an example). Here’s an algorithm you might use to accomplish this:
total
threshold
(e.g., threshhold = StdRandom.uniform();
)letter
and the associated count
threshold
to threshold - count / total
threshold
is less than zero, select letter
The basic idea is that we convert each count into a probability by dividing it by the sum of all counts. By subtracting each probability from our random threshold
, we give each letter the appropriate chance of being selected.
For example, assume these probabilities are 0.3, 0.5, and 0.2. A random number between 0 and 1 has a 30% chance of being less than 0.3, so threshold - 0.3
has a 30% of being less than 0. It also has a 50% chance of being between 0.3 and 0.8, so threshold - 0.3 - 0.5
has a 50% chance of being less than 0. Finally, the random number has a 20% chance of being betwen 0.8 and 1, so there’s a 20% chance the previous two letters weren’t selected and we select the third.
The starter project includes Junit test cases to assess the correctness of the generate
method of the TextGenerator
class. See Grading for how each test case contributes to your lab grade.
You should build your program in stages that you have planned out ahead of time. While writing and debugging your code, try using the small.txt
input file (containing only the theater is their thing
) and a k of 1. Using these fixed parameters will help you ensure the correctness of your code since you can verify the correct answers by hand on paper.
One benefit of breaking up your program into several classes (LetterInventory
, FrequencyTable
, and TextGenerator
) as described in Program Design and Implementation is that you can test each piece independently. As you work on each class, write a main
method that constructs an instance of your class and calls all its public
methods, printing their return values. That was you can assess whether these methods are working as expected. I have added a warning to checkstyle
about creating such a main
method to help you remember.
You are expected to submit files that contains no checkstyle
errors or warnings. You will lose 0.5 points per error (up to a maximum of 3) and per warning (up to a maximum of 2), as indicated in the Grading breakdown.
checkstyle
Warnings
For this and all following labs you are expected to address ALL checkstyle
problems, including warnings.
Consider attempting any or all of these additional challenges if you have time:
StopwatchCPU
class from algs4, measure the running time of generating 1000 characters of text for different input files and values of k. Based on these results, what is the big-O running time in terms of n (number of characters in the input file) and k? Submit a text file or a spreadsheet with your results and interpretation.Challenges are a way to go above and beyond what is expected for a lab. Only attempt them after you are completely finished with all other parts of the lab. Completing all the non-challenge components of a lab will earn 38/40 (95%, an A). Earning challenge points can result in a grade higher than 100%.
This lab will be graded out of 40 points as shown below. While most of the points for this lab are associated with specific test cases, partial credit can be earned for test cases that don’t pass. It it possible to earn a passing graded even if your submission does not pass any tests. Partial credit will be awarded based on evidence of a good-faith effort to implement the related features. Comments explaining your approach can help earn partial credit.
Requirement | Points |
---|---|
isRandom passes |
6 points |
usesFrequency passes |
6 points |
level1 passes |
5 points |
level2 passes |
4 points |
level4 passes |
4 points |
level8 passes |
4 points |
Submitted files do not have any checkstyle errors |
3 points |
Sbumitted files do not have any checkstyle warnings |
2 points |
Check-in post | 2 points |
TextGenerator.java compiles without warnings |
1 points |
Correct submission format (TextGenerator.java and any necessary supporting files) |
1 points |
Challenges | up to 4 points |
Adapted from Bill Lenhart and Bill Jannen: https://williams-cs.github.io/cs136-f20-www/labs/wordgen.html↩︎
Claude Shannon, “A mathematical theory of communication”, Bell System Technical Journal, 1948. Shannon’s technique was popularized by A.K. Dewdney’s A potpourri of programmed prose and prosody that appeared in Scientific American, 122-TK, 1989.↩︎