February 8, 2022
TextGenerator.java
and
any supporting files you created to Lab
5. If you attempt any Challenges upload
the files for those and a text file named challenges.txt
describing what you did.This and the remaining labs in this course will be somewhat different than the labs we’ve done so far. They will each be worth 40 points, and you will have more time to do them. These labs will be moderately larger than previous labs, but the extra time is primarily to give you the opportunity to devise and plan out your approach. As such, there will also have less in terms of step-by-step procedures to follow. Developing your independent problem-solving skills is crucial to using computers to solve problems, so these labs put an emphasis on practicing these skills.
In this lab you will write a program that analyzes the letter frequencies in text documents and then generate new documents based on those frequencies. Consider the following three quotes:
Call me Ishmael. Some years ago–never mind how long precisely–having repeatedly smelt the spleen respectfully, not to say reverentially, of a bad cold in his grego pockets, and throwing grim about with his tomahawk from the bowsprit?
Call me Ishmael. Some years ago–never mind how long precisely–having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
Call me Ishmael, said I to myself. We hast seen that the lesser man is far more prevalent than winds from the fishery.
The second quote is the first sentence of Herman Melville’s 1851 novel, Moby Dick. The other two quotes were generated in Melville’s style using a simple algorithm developed by Claude Shannon in 19482. You will implement Shannon’s algorithm in this lab.
The goals for this lab are
TextGenerator
.
Panopto view link: https://carleton.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=a9c98a3c-4dc0-4b90-8c23-accb012447b9
LetterInventory
class by Saturday,
2/12FrequencyTable
class by Monday,
2/14TextGenerator
constructor that
builds up a frequency table by Wednesday, 2/16generate
method by Friday,
2/18You can, of course, opt to use HashMap
s directly in the
TextGenerator
class instead of implementing
LetterInventory
and FrequencyTable
, though I
expect having these separate classes will make the code easier to write,
test, and debug.
The following algorithm is based on letter probability distributions.
Imagine taking the novel Tom Sawyer and counting the frequency
with which each character occurs. You’d probably find that spaces are
the most common, that the character 'e'
is fairly common,
and that the character 'q'
is rather uncommon. After
completing this level 0 analysis, you’d be able to produce
Tom Sawyer-esque text based on character frequencies. The idea
is to randomly select each character based on the probability of its
occurrence in the text. This random text wouldn’t have much in common
with the real novel, but characters would tend to appear with the same
proportions. In fact, here’s an example of what a level 0
analysis might produce:
rla bsht eS ststofo hhfosdsdewno oe wee h .mr ae irii ela iad o r te u t mnyto onmalysnce, ifu en c fDwn oee iteo
Now imagine doing a slightly more sophisticated level 1 analysis by
determining the frequency with which each character follows every
other character. You would probably discover that 'h'
is more likely to follow 't'
than 'x'
, and you
would probably discover that a space follows '.'
more
frequently than ','
does.
More concretely, if you are analyzing the text
"the theater is their thing"
, 'e'
appears
after 'h'
three times, and 'i'
appears after
'h'
one time; no other letters appear after
'h'
. So, given this example,
'e'
follows
'h'
is 0.75
,'i'
follows
'h'
is 0.25
,'h'
is 0
.Using a level 1 analysis, you could produce randomly generated Tom Sawyer-esque text by 1) picking a character, and then 2) always choosing the next character based on the probability of the next character given the chosen character. Here’s an example of level 1 random text:
“Shand tucthiney m?” le ollds mind Theybooure He, he s whit Pereg lenigabo Jodind alllld ashanthe ainofevids tre lin–p asto oun theanthadomoere
Now imagine doing a level k analysis by determining the
probability with which each character follows every possible
sequence of characters of length k. For example, a level 5
analysis of Tom Sawyer would reveal that 'r'
follows "Sawye"
more frequently than any other character.
After a level k analysis, you would be able to generate Tom
Sawyer-esque text by always choosing the next character based on
the previous k characters and the probabilities revealed by the
analysis.
At only a moderate level of analysis (levels 5–7), randomly-generated text begins to take on many of the characteristics of the source text. It probably won’t make complete sense, but you’ll be able to tell that it was derived from Tom Sawyer as opposed to, say, Moby Dick. Here are some more examples:
(Level 2:) “Yess been.” for gothin, Tome oso; ing, in to weliss of an’te cle - armit. Papper a comeasione, and smomenty, fropeck hinticer, sid, a was Tom, be suck tied. He sis tred a youck to themen
(Level 4) en themself, Mr. Welshman, but him awoke, the balmy shore. I’ll give him that he couple overy because in the slated snufflindeed structure’s kind was rath. She said that the wound the door a fever eyes that WITH him.
(Level 6:) people had eaten, leaving. Come - didn’t stand it better judgment; His hands and bury it again, tramped herself! She’d never would be. He found her spite of anything the one was a prime feature sunset, and hit upon that of the forever.
(Level 8:) look-a-here - I told you before, Joe. I’ve heard a pin drop. The stillness was complete, however, this is awful crime, beyond the village was sufficient. He would be a good enough to get that night, Tom and Becky.
(Level 10:) you understanding that they don’t come around in the cave should get the word “beauteous” was over-fondled, and that together” and decided that he might as we used to do - it’s nobby fun. I’ll learn you.”
Once we have processed input text and stored it in a map structure that allows us to check frequencies, we pick k letters (for example, the first k in the input text) to use as a seed for our new text. Then we choose subsequent characters based on the preceding k characters and their frequency information. We would need to pick a starting k letters that actually appear in the input text, so we have information about what letter should come next.
This lab is deliberately more open-ended than the previous few labs. It’s important to practice the skill of designing your own approach to a new problem. So instead of a step-by-step procedure, this section contains general guidance I think might be useful to you.
The class that will run your program, TextGenerator
, is
the only one provided in the starter project. It must have the given
constructor (public TextGenerator(int k, String filename)
)
and method (public String generate(int length)
), as the
grading test cases depend on those. Beyond that, you are free to use any
design you choose. One possible approach is discussed below.
When thinking about the design, focus on what would constitute a good data structure for this problem. Your data structure needs to keep a table of info and be able to support two key operations:
You might try to save the frequency information in a big array, but the size will quickly become too large. For k = 2, you would need a three-dimensional array whose size is about 27,000 entries if you include blanks and punctuation. For larger k, the number of entries grows rapidly.
Instead, you could develop a FrequencyTable
class
implemented with a HashMap
. The HashMap
could
store character sequences (i.e., String
s) as the
keys and LetterInventory
s as the values
(i.e.,
private HashMap<String, LetterInventory> table
). The
LetterInventory
is a class (which you would also develop)
that keeps track of which characters appear after a given sequence,
along with a frequency.
There are many ways to implement a LetterInventory
. A
good start is… another HashMap
! Thus, the letter
inventory’s HashMap
stores key-value pairs where each
key is a single character (stored as a Character
)
and each value is the number of times the given letter occurs
after the k-character sequence with which the inventory is
associated (stored as an Integer
). This would have type
HashMap<Character, Integer>
. Think carefully about
what methods the letter inventory should have.
Let’s say you have a field
HashMap<String, LetterInventory> table
and you want
to select a random key from the map. You might want to do something like
this in order to pick an initial k-length string for
generate
to start with (or better yet, implement a
FrequencyTable
method that provides this functionality).
Doing this turns out to be a little cumbersome in Java. The first step
is to get an array of all the keys:
Object[] keys = table.keySet().toArray();
keys
is type Object[]
because the
toArray
method returns an array of Object
s.
Next we can use the StdRandom
class from the algs4 library
to generate a random index for this array:
String randomKey = (String) keys[StdRandom.uniform(keys.length)];
As part of this frequency analysis approach, you will be faced with
the situation of having characters and associated counts (in say, your
LetterInventory
class), and needing to select a character
with probability proportional to its frequency (see Level 1 analysis for an example). Here’s an
algorithm you might use to accomplish this:
total
threshold
(e.g.,
threshhold = StdRandom.uniform();
)letter
and the associated
count
threshold
to
threshold - count / total
threshold
is less than zero, select
letter
The basic idea is that we convert each count into a probability by
dividing it by the sum of all counts. By subtracting each probability
from our random threshold
, we give each letter the
appropriate chance of being selected.
For example, assume these probabilities are 0.3, 0.5, and 0.2. A
random number between 0 and 1 has a 30% chance of being less than 0.3,
so threshold - 0.3
has a 30% of being less than 0. It also
has a 50% chance of being between 0.3 and 0.8, so
threshold - 0.3 - 0.5
has a 50% chance of being less than
0. Finally, the random number has a 20% chance of being between 0.8 and
1, so there’s a 20% chance the previous two letters weren’t selected and
we select the third.
The starter project includes JUnit test cases to assess the
correctness of the generate
method of the
TextGenerator
class. See Grading for
how each test case contributes to your lab grade.
You should build your program in stages that you have planned out
ahead of time. While writing and debugging your code, try using the
small.txt
input file (containing only
the theater is their thing
) and a k of 1. Using
these fixed parameters will help you ensure the correctness of your code
since you can verify the correct answers by hand on paper.
One benefit of breaking up your program into several classes
(LetterInventory
, FrequencyTable
, and
TextGenerator
) as described in Program Design and
Implementation is that you can test each piece independently. As you
work on each class, write a main
method that constructs an
instance of your class and calls all its public
methods,
printing their return values. That was you can assess whether these
methods are working as expected. I have added a warning to
checkstyle
about creating such a main
method
to help you remember.
You are expected to submit files that contains no
checkstyle
errors or warnings. You will lose 0.5
points per error (up to a maximum of 3) and per warning (up to a maximum
of 2), as indicated in the Grading breakdown.
checkstyle
Warnings
For this and all following labs you are expected to address
ALL checkstyle
problems, including
warnings.
Consider attempting any or all of these additional challenges if you have time:
StopwatchCPU
class from algs4, measure the running time of generating 1000 characters
of text for different input files and values of k (time the
whole process, including reading the input file and building up the
table of frequencies). Based on these results, what is the big-O running
time in terms of n (number of characters in the input file) and
k? Submit a text file or a spreadsheet with your results and
interpretation.Challenges are a way to go above and beyond what is expected for a lab. Only attempt them after you are completely finished with all other parts of the lab. Completing all the non-challenge components of a lab will earn 38/40 (95%, an A). Earning challenge points can result in a grade higher than 100%.
This lab will be graded out of 40 points as shown below. While most of the points for this lab are associated with specific test cases, partial credit can be earned for test cases that don’t pass. It it possible to earn a passing graded even if your submission does not pass any tests. Partial credit will be awarded based on evidence of a good-faith effort to implement the related features. Comments explaining your approach can help earn partial credit.
Requirement | Points |
---|---|
isRandom passes |
6 points |
usesFrequency passes |
6 points |
level1 passes |
5 points |
level2 passes |
4 points |
level4 passes |
4 points |
level8 passes |
4 points |
Submitted files do not have any checkstyle errors |
3 points |
Submitted files do not have any checkstyle
warnings |
2 points |
Check-in post | 2 points |
TextGenerator.java compiles without warnings |
1 point |
Correct submission format (TextGenerator.java and any
necessary supporting files) |
1 point |
Challenges | up to 4 points |
Adapted from Bill Lenhart and Bill Jannen: https://williams-cs.github.io/cs136-f20-www/labs/wordgen.html↩︎
Claude Shannon, “A mathematical theory of communication”, Bell System Technical Journal, 1948. Shannon’s technique was popularized by A.K. Dewdney’s A potpourri of programmed prose and prosody that appeared in Scientific American, 122-TK, 1989.↩︎