**CS 111 w20 --- Lab 3: Gerrymandering (Pair Programming)** *Due: Monday, February 3, at 9pm* ![](lab3_images/gerrymandering.png width="70%") (##) Purpose For this lab, you'll be writing a Python program to analyze a data set. Python is frequently the programming language of choice for performing data analysis in the sciences and other fields, so the skills you practice in the lab are some the most broadly applicable in CS 111. You'll work with data collected by the [MIT Election Data and Science Lab (MEDSL)](https://electionlab.mit.edu/) about elections to the US House of Representatives in 2018. The central question you will analyze is the issue of [gerrymandering](https://en.wikipedia.org/wiki/Gerrymandering). This is the idea of drawing electoral boundaries to benefit one political party over another. This short video gives an excellent explanation: https://www.youtube.com/watch?v=Mky11UJb9AY. For a real-world example, see FiveThirtyEight's explanation of [measuring a partisan gerrymander](https://projects.fivethirtyeight.com/partisan-gerrymandering-north-carolina/). You'll be performing a similar analysis, comparing the each party's share of the vote received to the share of the seats won in each state in 2018. Numerous lawsuits have been filed across the country in recent years challenging current maps as unconstitutional, and those court battles may have a big impact on the next round of redistricting after the 2020 census. The purpose of this lab is to - practice using nested loops and conditionals to implement a more complex algorithm - practice manipulating lists and strings - practice writing code to read from and write to files (i.e., persistant storage) - see how computing can help illuminate a current social and political issue - learn a little about US politics---lots of information linked from this writeup! (##) Logistics Download the following data file to start the lab: - [`district_overall_2018.csv`](district_overall_2018.csv): results for elections to the [US House of Representatives](https://en.wikipedia.org/wiki/United_States_House_of_Representatives), which along with the US Senate forms the US Congress. The sections below lay out the steps for your analysis of gerrymandering. Your code will read in the data file, and write output to a file called `gerrymandering-report.txt`. **Your code should not read in any other data, nor declare data besides what is read in from the CSV file** (e.g., you should not declare a list of states like `states = ["Alabama", "Alaska", ...]`---any data like this must be computed from the input files). Start early and post questions to the [Moodle forum](https://moodle.carleton.edu/mod/forum/view.php?id=507509)! **Suggested timeline**: - complete [Dealing With Data](#dealingwithdata) and [Accessor Functions](#accessorfunctions) by Wednesday - complete [Useful Lists](#usefullists) and implementing the [Analysis Algorithm](#analysisalgorithm) by Friday - use the remaining time to handle producing the [Output File](#outputfile) and [interpreting the results](#what'supwithgerry?) Remember to consult the [Testing section](#testing) for advice on how to test your solution. --- # Dealing With Data The first step in any data analysis is to load in your data. In this case, we need to open and read in `district_overall_2018.csv`, which has information about elections to the House. ## Reading the File Start a new Python script called `gerrymandering.py`. In it, use the built-in function `open` to open the data file like this: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Python data_file = open("district_overall_2018.csv") ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For this to work, `district_overall_2018.csv` must be in the same directory as `gerrymandering.py`. Now that we've opened the file, we can start reading it in. First, though, we need to actually look at the file to see how the data is arranged. Open `district_overall_2018.csv` (you can do this right in VS Code) and take a look around. You may find it useful to refer [documentation from MEDSL](https://github.com/MEDSL/2018-elections-official/blob/master/codebook-2018-election-data.md) about the variables (i.e., columns) in the data. An election is uniquely identified by the combination of `state` and `district` (e.g., the first few lines of the file correspond to the [Alabama District 1 election](https://ballotpedia.org/Alabama%27s_1st_Congressional_District_election,_2018), followed by the [Alabama District 2 election](https://ballotpedia.org/Alabama%27s_2nd_Congressional_District_election,_2018), and so on). Note the answers to the following questions (not to turn in, just to orient yourself): - What is different about the first line of the file? - How many fields (i.e., columns) are there? As a Comma Separated Value (CSV) file, commas appear between the fields on each line - What fields would you need to determine the wining party for a particular election? The file object returned by the `open` function has a couple useful methods for reading in our data. - `data_file.readline()` will return the next line in the file as a string. - `data_file.readlines()` will return a list of all the remaining lines in the file, where each element of the list is a string of one of the lines. As we saw in class, you can also use a `for` loop to iterate over the lines in a file. Your goal is to have a variable `data_lines` that stores a list of all the lines in the file, except the very first line. This first line contains the names of the fields in the data file, and including it in `data_lines` will cause problems when it comes time to do the analysis. You can handle excluding this line as you read in the file or after the fact using the list method [`pop`](https://docs.python.org/3.7/tutorial/datastructures.html?highlight=pop). Remember to have your code close the data file after you've finished reading it with `data_file.close()`. ## Processing the Data At this point, you have the raw data, but it's not in a very convenient form. If you add `print(data_lines[0])` to `gerrymandering.py`, you'll see that our data is still in text form rather than divided into columns. Fortunately, Python strings have a [`split` method](https://docs.python.org/3.7/library/stdtypes.html#str.split) that will do exactly what you need. Follow the link to the documentation on `split` and read through the description and examples to see how you can use it to convert a line of input data from raw text into a list where each element is a entry in the data. Write a `for` loop that replaces each element of `data_lines` with the corresponding split-up version. You'll know you've succeeded when `print(data_lines[-3])` displays ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Python ['2018', 'Wyoming', 'WY', '56', '83', '68', 'U.S. Representative', 'District 0', 'gen', 'FALSE', 'Liz Cheney', 'republican', 'FALSE', 'total', '127963', '201245', 'FALSE', '20190131\n'] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Accessor Functions `data_lines` is now a list of lists where each inner list contains the data from one line of the input file. If we want to access a specific field within a line, we can do so by accessing the inner list element at the appropriate index. For example, the `office` field (indicating which elected office an election was for) is the seventh "column" in our data file (counting by commas), so `data_files[0][6]` accesses the `office` field of the first line in the file. Writing `[6]` everywhere to access the `office` field is bad style, however, since it means we have to remember that `[6]` means "get the office." A much better approach would be to define a function for accessing the `office` field: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Python def get_office(line): return line[6] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This would let us write the far more understandable `get_office(data_lines[0])`. You will be expected to write a similar function for every field you need to access in your analysis. If that field is numerical (e.g., a number of votes), you should use the `int` function to covert the text to a number as part of the function. As far as `get_office` goes, you won't need that particular function. Since we're only dealing with data about US House of Representatives elections, the `office` field is not very interesting for our analysis (it's the same for every line). # Useful Lists Before you start coding up the analysis itself, there are two lists you should generate that will make your life much easier: a list of all 50 states, and a list of all 435 elections. As mentioned in the Logistics section, you should generate these lists from the data instead of typing them out directly in your code. To generate a list of states, you might try ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Python states = [] for line in data_lines: states.append(get_state(line)) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This has a problem: every state appears multiple times in the data file, so we end up with a ton of duplicate entries in our list. It's much more useful to have a list where each state appears only once. To accomplish this, insert an `if` statement in your loop over the data that checks whether a state has already been added to the list before appending it. Remember that you can use `VALUE in LIST` to get `True` if `VALUE` is present in `LIST` and `False` otherwise. You should also generate a list of elections in a similar fashion. As described above, an election is identified by the combination of the state and district. Write a `get_election` function that returns the state and district fields concatenated together, and then use a loop to generate a list `elections`. You can check your work by printing out the length of the `states` and `elections` lists, which should be 50 and 435, respectively. # Analysis Algorithm Now everything is in place for your analysis of potential gerrymandering. Specially, you need to compute four results for **each state**: - The percentage of the two-party vote received by Republicans - The percentage of the seats won by Republicans - The percentage of the two-party vote received by Democrats - The percentage of the seats won by Democrats Here the *two-party vote* is counting only the votes for Republicans and Democrats, not including any other parties or write-in votes. Below is pseudocode (something partway between English and a programming language) for the algorithm you should use to compute these quantities. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ none for each state initialize variables for GOP and Dem votes and seats for each election in the current state generate a list of vote-party tuples from the current election for each vote-party tuple add votes to the corresponding party's vote total determine the winning party add one to the winning party's seat total output results ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Some advice on translating this pseudocode to Python: - To loop over elections in the current state, use a `for` loop over `elections` with an `if` statement inside to check that election matches the current state. The string method [`startswith`](https://docs.python.org/3/library/stdtypes.html#str.startswith) is a good way to check this. - To generate a list of vote-party tuples from the current election, initialize an empty list and then use a `for` loop over `data_lines` with an `if` statement checking if the election matches. Inside the `if`, append a tuple of the votes and the party from the current line. It's important that votes be the first element of the tuple. - Remember that you can access the elements of a tuple with indexing just like elements of a list. - The easiest way to determine the winning party is the use the list method [`sort`](https://docs.python.org/3/tutorial/datastructures.html#more-on-lists) on your list of tuples. It will sort the list in ascending order (smallest to largest) according to the first element of each tuple. This will conveniently place the tuple with the largest number of votes at the end of the list, which we can access at index -1. - You can use `if`-`elif` to determine which party's variables to add votes or seats to (the two parties are `"republican"` and `"democrat"`) See the next section for details on output. # Output File Instead of sending the output of your analysis to the screen with `print`, you will write it to a file. The output for each state should be in the following format ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ none STATE NAME Republican: REPUB_VOTE_PERCENTAGE% of the vote, REPUB_SEAT_PERCENTAGE% of the seats Democrat: DEM_VOTE_PERCENTAGE% of the vote, DEM_SEAT_PERCENTAGE% of the seats BLANK LINE ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All percentages should be rounded to two decimal places (use the built-in `round` function). For example, the output for the first two states would be: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ none Alabama Republican: 58.98% of the vote, 85.71% of the seats Democrat: 41.02% of the vote, 14.29% of the seats Alaska Republican: 53.31% of the vote, 100.0% of the seats Democrat: 46.69% of the vote, 0.0% of the seats ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ At the top of the file where you open the data file, add the following line to open the output file in *write mode*. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Python output_file = open("gerrymandering-report.txt", "w") ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This will create a new, blank file called `gerrymandering-report.txt`, overwriting any file with that name that already exists. Remember to close the output file at the end of your program. To write your output to this file, you'll use its `write` method (i.e., `output_file.write("...")`). `write` differs from `print` in two important ways: - `write` does not insert a new line with each call the way `print` does, so you will need to handle new lines yourself. In Python the new line character is written `"\n"`. - while `print` can take multiple arguments and combines them in the output with spaces in between, `write` will only take a single argument. You can use `+` to concatenate multiple strings together, so you might use ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Python output_file.write(state + "\n") ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ to write a line to the output file with the name of a state. This concatenation will only work between strings, so you'll have to use the `str` function to convert any numbers into strings. Here's an example of `print` and `write` calls that produce the same output (one to the screen and the other to a file). Note how you need to put in spaces between words yourself when using `write`: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Python outfile = open("rep_count.txt", "w") state = "Minnesota" count = 8 print(state, "has", count, "representatives") outfile.write(state + " has " + str(count) + " representatives\n") ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # What's Up With Gerry? Data analysis is just number crunching without some kind of human interpretation or action based on the results. Now that you've written code to crunch the numbers, it's time to interpret! The redistricting after the 2010 census has been challenged in court in a number of different states. The analysis you perform in this lab is far from the only way to measure gerrymandering, but it can still help illuminate what's going on. If you haven't already, please review this [Youtube video](https://www.youtube.com/watch?v=Mky11UJb9AY) and [FiveThirtyEight article](https://projects.fivethirtyeight.com/partisan-gerrymandering-north-carolina/) on gerrymandering. Your task is to interpret the results of your analysis for three states: Maryland, Michigan, and Wisconsin. The question you are to consider is whether your analysis supports claims of gerrymandering in those states. In comments at the top of `gerrymandering.py`, write two paragraphs answering this question. Your answer should include both numerical results from your analysis as well as comparisons between those three states and others. If your answer is that your results don't clearly support any conclusion about the presence of gerrymandering, that's perfectly fine---just make sure to support your argument. I'm not looking for an in-depth essay here, just a demonstration that you've taken the important step in data analysis of engaging with the issue you are analyzing. # Testing You will be expected to write code the produces an output file in **exactly** the format described above. To help you check this as well as check your numerical results, you can use this sample output file: [sample-gerrymandering-report.txt](sample-gerrymandering-report.txt). VS Code provides a [very useful way of comparing two files](https://www.meziantou.net/comparing-files-using-visual-studio-code.htm#comparing-files-usin). Use the method described in that link to bring up your output file and the sample output file side-by-side, and make sure there are no differences (differences will show up highlighted green or red in VS Code). # OPTIONAL CHALLENGES Play around with any or all of these challenges if you find yourself with extra time. Put all code for these in `extensions.py` (including copying any or all of your code from `gerrymandering.py`). (##) Not Your Average Democratic Party The analysis you've done so far has a critical flaw---it gets the wrong results for Minnesota! It is not the case that Minnesota Republicans won 100% of the vote in 2018. Track down what's causing this incorrect result, and implement a fixed version of your gerrymandering analysis in `extensions.py`. (##) Margin of Victory In most US elections[^othervoting], the candidate with the most votes on Election Day wins. How much or how little a candidate wins by (i.e., the difference between the winning vote total and the second place vote total), however, can influence how the result is interpreted and how the winner behaves in office. For example, a Representative who won by the slimmest of margins might be more politically cautious than one who won in a landslide[^landslide]. Add code to `extensions.py` to find the closest and most lopsided **contested** House races of 2018 (**measured in number of votes between first and second place**). For each one, it should write a line to `extensions.txt` describing the seat (i.e., state and district as in `Washington District 3`) and this margin of victory. (##) D) None of the Above In addition for voting for candidates on the ballot, voters can instead cast votes for candidates (or anyone, really) that *don't* appear on the ballot. These are called *write-in* votes, as the require the voter to literally write in the name of the person they want to vote for. Very rarely a write-in candidate can even win, as was the case for [Alaska Senator Lisa Murkowski's reelection in 2010](https://en.wikipedia.org/wiki/2010_United_States_Senate_election_in_Alaska), but more often a write-in vote is intended as satire or protest (see the [120 write-in votes for Donald Duck in the 2010 general elections in Sweden](https://www.telegraph.co.uk/news/worldnews/europe/sweden/8021258/Donald-Duck-and-God-mar-Swedish-election.html)). Add code to `extensions.py` to find the 2018 House election with the most write-in votes. Your code should write a line to `extensions.txt` with both the election and the number of write-in votes. [^othervoting]: exceptions include two-round runoff elections as in [Louisiana](https://en.wikipedia.org/wiki/Elections_in_Louisiana) and the [ranked-choice voting recently instituted in Maine](https://en.wikipedia.org/wiki/2016_Maine_Question_5) [^landslide]: *won in a landslide* is a phrase meaning a candidate had a huge margin of victory # What to Turn In Submit the following files via the [Lab 3 Moodle page](https://moodle.carleton.edu/mod/assign/view.php?id=507495). You **do not** need to submit any CSV files. - A file called `gerrymandering.py` that when run uses the provided data file to create (i.e., open in write mode) a file called `gerrymandering-report.txt` formatted as described in [Writing Output](#writingoutput). Please do **not submit** your output file. We will run your code to generate the output file as part of the grading. - OPTIONALLY a file called `extensions.py` containing your code for any OPTIONAL CHALLENGES. - A file called `feedback.txt` with the following information (you get points for answering these questions!): - How many hours you spent outside of class on this homework. - The difficulty of this homework for this point in a 100-level course as: too easy, easy, moderate, challenging, or too hard. - What did you learn on this homework (very briefly)? Rate the educational value relative to the time invested from 1 (low) to 5 (high). # Grading This assignment is graded out of 30 points as follows: - Submitting `feedback.txt` -- 3 points - Running `gerrymandering.py` produces a file called `gerrymandering-report.txt` -- 2 points - Only uses data from the input file -- 3 points - Only uses accessor functions to access fields -- 1 points - Generates unique lists of states and elections -- 1 points - Output file matches specified format -- 4 points - Output file contains correct numerical results -- 6 points - Interpretation of results -- 5 points + Includes numerical results as evidence + Includes comparisons between states to put results in context + Reflects a basic understanding of the gerrymandering issue - Style -- 5 points + Good coding style including putting spaces around `=` and operators (`>`, `or`, etc.) -- 1 point + Descriptive variable names -- 1 point + Brief comments describing each section of code -- 1 point + Comments with name, date, and purpose at the top of each `.py` file you submit -- 1 point + All opened files are closed -- 1 point Your output file will be evaluated by automated script that tests for correct format and results. If you submit code that crashes the autograder (because your code crashes or you used different function names than those specified in this writeup), we will attempt to fix any issues and re-grade. You will lose points for each issue we need to fix. All other aspects will be evaluated by us reading your code.