Lab 5: Text Generation

Aaron Bauer

February 8, 2022

Lab 5: Text Generation1

Project Labs

This and the remaining labs in this course will be somewhat different than the labs we’ve done so far. They will each be worth 40 points, and you will have more time to do them. These labs will be moderately larger than previous labs, but the extra time is primarily to give you the opportunity to devise and plan out your approach. As such, there will also have less in terms of step-by-step procedures to follow. Developing your independent problem-solving skills is crucial to using computers to solve problems, so these labs put an emphasis on practicing these skills.

Overview

In this lab you will write a program that analyzes the letter frequencies in text documents and then generate new documents based on those frequencies. Consider the following three quotes:

Call me Ishmael. Some years ago–never mind how long precisely–having repeatedly smelt the spleen respectfully, not to say reverentially, of a bad cold in his grego pockets, and throwing grim about with his tomahawk from the bowsprit?

Call me Ishmael. Some years ago–never mind how long precisely–having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.

Call me Ishmael, said I to myself. We hast seen that the lesser man is far more prevalent than winds from the fishery.

The second quote is the first sentence of Herman Melville’s 1851 novel, Moby Dick. The other two quotes were generated in Melville’s style using a simple algorithm developed by Claude Shannon in 19482. You will implement Shannon’s algorithm in this lab.

The goals for this lab are

Preparatory Exercises

  1. Read this entire writeup, paying careful attention to Discussion and Program Design and Implementation
  2. Watch the Introductory Video
  3. Plan out how you will approach implementing TextGenerator.
    • What other classes will you create?
    • What methods will they have?
    • What data structures will you use?
  4. Download the starter project, unzip it into its own folder, and open that folder in VS Code.

Introductory Video

Panopto view link: https://carleton.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=a9c98a3c-4dc0-4b90-8c23-accb012447b9

Suggested Timeline

You can, of course, opt to use HashMaps directly in the TextGenerator class instead of implementing LetterInventory and FrequencyTable, though I expect having these separate classes will make the code easier to write, test, and debug.

Discussion

Level 0 analysis

The following algorithm is based on letter probability distributions. Imagine taking the novel Tom Sawyer and counting the frequency with which each character occurs. You’d probably find that spaces are the most common, that the character 'e' is fairly common, and that the character 'q' is rather uncommon. After completing this level 0 analysis, you’d be able to produce Tom Sawyer-esque text based on character frequencies. The idea is to randomly select each character based on the probability of its occurrence in the text. This random text wouldn’t have much in common with the real novel, but characters would tend to appear with the same proportions. In fact, here’s an example of what a level 0 analysis might produce:

rla bsht eS ststofo hhfosdsdewno oe wee h .mr ae irii ela iad o r te u t mnyto onmalysnce, ifu en c fDwn oee iteo

Level 1 analysis

Now imagine doing a slightly more sophisticated level 1 analysis by determining the frequency with which each character follows every other character. You would probably discover that 'h' is more likely to follow 't' than 'x', and you would probably discover that a space follows '.' more frequently than ',' does.

More concretely, if you are analyzing the text "the theater is their thing", 'e' appears after 'h' three times, and 'i' appears after 'h' one time; no other letters appear after 'h'. So, given this example,

Using a level 1 analysis, you could produce randomly generated Tom Sawyer-esque text by 1) picking a character, and then 2) always choosing the next character based on the probability of the next character given the chosen character. Here’s an example of level 1 random text:

“Shand tucthiney m?” le ollds mind Theybooure He, he s whit Pereg lenigabo Jodind alllld ashanthe ainofevids tre lin–p asto oun theanthadomoere

Level k analysis

Now imagine doing a level k analysis by determining the probability with which each character follows every possible sequence of characters of length k. For example, a level 5 analysis of Tom Sawyer would reveal that 'r' follows "Sawye" more frequently than any other character. After a level k analysis, you would be able to generate Tom Sawyer-esque text by always choosing the next character based on the previous k characters and the probabilities revealed by the analysis.

At only a moderate level of analysis (levels 5–7), randomly-generated text begins to take on many of the characteristics of the source text. It probably won’t make complete sense, but you’ll be able to tell that it was derived from Tom Sawyer as opposed to, say, Moby Dick. Here are some more examples:

(Level 2:) “Yess been.” for gothin, Tome oso; ing, in to weliss of an’te cle - armit. Papper a comeasione, and smomenty, fropeck hinticer, sid, a was Tom, be suck tied. He sis tred a youck to themen

(Level 4) en themself, Mr. Welshman, but him awoke, the balmy shore. I’ll give him that he couple overy because in the slated snufflindeed structure’s kind was rath. She said that the wound the door a fever eyes that WITH him.

(Level 6:) people had eaten, leaving. Come - didn’t stand it better judgment; His hands and bury it again, tramped herself! She’d never would be. He found her spite of anything the one was a prime feature sunset, and hit upon that of the forever.

(Level 8:) look-a-here - I told you before, Joe. I’ve heard a pin drop. The stillness was complete, however, this is awful crime, beyond the village was sufficient. He would be a good enough to get that night, Tom and Becky.

(Level 10:) you understanding that they don’t come around in the cave should get the word “beauteous” was over-fondled, and that together” and decided that he might as we used to do - it’s nobby fun. I’ll learn you.”

Once we have processed input text and stored it in a map structure that allows us to check frequencies, we pick k letters (for example, the first k in the input text) to use as a seed for our new text. Then we choose subsequent characters based on the preceding k characters and their frequency information. We would need to pick a starting k letters that actually appear in the input text, so we have information about what letter should come next.

Program Design and Implementation

This lab is deliberately more open-ended than the previous few labs. It’s important to practice the skill of designing your own approach to a new problem. So instead of a step-by-step procedure, this section contains general guidance I think might be useful to you.

The class that will run your program, TextGenerator, is the only one provided in the starter project. It must have the given constructor (public TextGenerator(int k, String filename)) and method (public String generate(int length)), as the grading test cases depend on those. Beyond that, you are free to use any design you choose. One possible approach is discussed below.

When thinking about the design, focus on what would constitute a good data structure for this problem. Your data structure needs to keep a table of info and be able to support two key operations:

You might try to save the frequency information in a big array, but the size will quickly become too large. For k = 2, you would need a three-dimensional array whose size is about 27,000 entries if you include blanks and punctuation. For larger k, the number of entries grows rapidly.

Instead, you could develop a FrequencyTable class implemented with a HashMap. The HashMap could store character sequences (i.e., Strings) as the keys and LetterInventorys as the values (i.e., private HashMap<String, LetterInventory> table). The LetterInventory is a class (which you would also develop) that keeps track of which characters appear after a given sequence, along with a frequency.

There are many ways to implement a LetterInventory. A good start is… another HashMap! Thus, the letter inventory’s HashMap stores key-value pairs where each key is a single character (stored as a Character) and each value is the number of times the given letter occurs after the k-character sequence with which the inventory is associated (stored as an Integer). This would have type HashMap<Character, Integer>. Think carefully about what methods the letter inventory should have.

Selecting a random key

Let’s say you have a field HashMap<String, LetterInventory> table and you want to select a random key from the map. You might want to do something like this in order to pick an initial k-length string for generate to start with (or better yet, implement a FrequencyTable method that provides this functionality). Doing this turns out to be a little cumbersome in Java. The first step is to get an array of all the keys:

Object[] keys = table.keySet().toArray();

keys is type Object[] because the toArray method returns an array of Objects. Next we can use the StdRandom class from the algs4 library to generate a random index for this array:

String randomKey = (String) keys[StdRandom.uniform(keys.length)];

Selecting the next character

As part of this frequency analysis approach, you will be faced with the situation of having characters and associated counts (in say, your LetterInventory class), and needing to select a character with probability proportional to its frequency (see Level 1 analysis for an example). Here’s an algorithm you might use to accomplish this:

  1. Find the sum of all the counts, call it total
  2. Generate a random number between 0 and 1, call it threshold (e.g., threshhold = StdRandom.uniform();)
  3. Loop over the each letter and the associated count
    1. Update threshold to threshold - count / total
    2. If threshold is less than zero, select letter

The basic idea is that we convert each count into a probability by dividing it by the sum of all counts. By subtracting each probability from our random threshold, we give each letter the appropriate chance of being selected.

For example, assume these probabilities are 0.3, 0.5, and 0.2. A random number between 0 and 1 has a 30% chance of being less than 0.3, so threshold - 0.3 has a 30% of being less than 0. It also has a 50% chance of being between 0.3 and 0.8, so threshold - 0.3 - 0.5 has a 50% chance of being less than 0. Finally, the random number has a 20% chance of being between 0.8 and 1, so there’s a 20% chance the previous two letters weren’t selected and we select the third.

Testing

The starter project includes JUnit test cases to assess the correctness of the generate method of the TextGenerator class. See Grading for how each test case contributes to your lab grade.

You should build your program in stages that you have planned out ahead of time. While writing and debugging your code, try using the small.txt input file (containing only the theater is their thing) and a k of 1. Using these fixed parameters will help you ensure the correctness of your code since you can verify the correct answers by hand on paper.

One benefit of breaking up your program into several classes (LetterInventory, FrequencyTable, and TextGenerator) as described in Program Design and Implementation is that you can test each piece independently. As you work on each class, write a main method that constructs an instance of your class and calls all its public methods, printing their return values. That was you can assess whether these methods are working as expected. I have added a warning to checkstyle about creating such a main method to help you remember.

Style

You are expected to submit files that contains no checkstyle errors or warnings. You will lose 0.5 points per error (up to a maximum of 3) and per warning (up to a maximum of 2), as indicated in the Grading breakdown.

checkstyle Warnings

For this and all following labs you are expected to address ALL checkstyle problems, including warnings.

Challenges

Consider attempting any or all of these additional challenges if you have time:

Challenges

Challenges are a way to go above and beyond what is expected for a lab. Only attempt them after you are completely finished with all other parts of the lab. Completing all the non-challenge components of a lab will earn 38/40 (95%, an A). Earning challenge points can result in a grade higher than 100%.

Grading

This lab will be graded out of 40 points as shown below. While most of the points for this lab are associated with specific test cases, partial credit can be earned for test cases that don’t pass. It it possible to earn a passing graded even if your submission does not pass any tests. Partial credit will be awarded based on evidence of a good-faith effort to implement the related features. Comments explaining your approach can help earn partial credit.

Requirement Points
isRandom passes 6 points
usesFrequency passes 6 points
level1 passes 5 points
level2 passes 4 points
level4 passes 4 points
level8 passes 4 points
Submitted files do not have any checkstyle errors 3 points
Submitted files do not have any checkstyle warnings 2 points
Check-in post 2 points
TextGenerator.java compiles without warnings 1 point
Correct submission format (TextGenerator.java and any necessary supporting files) 1 point
Challenges up to 4 points

  1. Adapted from Bill Lenhart and Bill Jannen: https://williams-cs.github.io/cs136-f20-www/labs/wordgen.html↩︎

  2. Claude Shannon, “A mathematical theory of communication”, Bell System Technical Journal, 1948. Shannon’s technique was popularized by A.K. Dewdney’s A potpourri of programmed prose and prosody that appeared in Scientific American, 122-TK, 1989.↩︎