**CS 111 f19 - Homework 3 (Pair Programming)** *Due: Monday, October 14, at 9pm* For this homework, you'll be writing a Python program to analyze a data set. Python is frequently the programming language of choice for performing data analysis in the sciences and other fields, so the skills you practice in the homework are some the most broadly applicable in CS 111. You'll work with data collected by the [MIT Election Data and Science Lab (MEDSL)](https://electionlab.mit.edu/) about state and federal elections in the US in 2018. Download the following files into the same directory to start the homework: - [`district_overall_2018.csv`](district_overall_2018.csv): results for elections to the [US House of Representatives](https://en.wikipedia.org/wiki/United_States_House_of_Representatives), which along with the US Senate forms the US Congress. - [`state_overall_2018.csv`](state_overall_2018.csv): results for state-level elections including statewide (e.g., [Governor](https://en.wikipedia.org/wiki/Governor_of_Minnesota) or [Attorney General](https://en.wikipedia.org/wiki/Attorney_General_of_Minnesota)) and state legislature (e.g., [Minnesota House of Representatives](https://en.wikipedia.org/wiki/Minnesota_House_of_Representatives)). The sections below describe a series of questions about this data you will write code to answer. Your code will read in the two data files above, and write the answers to the questions to a file called `report.txt`. **Your code should not read in any other data, nor declare data besides that read in from the CSV files** (e.g., you should not declare a list of states like `states = ["Alabama", "Alaska", ...]`--any data like this must be computed from the input files). Start early and post questions in the [Homework 3 forum](https://moodle.carleton.edu/mod/forum/view.php?id=487648)! --- # Getting Started Let's start by using Python and the available data to answer the question: _who ran to represent Northfield in Congress in 2018?_ Looking at this map of Minnesota's 2nd Congressional District, we can see it includes Northfield, so that's the district we need to investigate. ![Minnesota 2nd Congressional District (source: [Wikipedia](https://en.wikipedia.org/wiki/Minnesota%27s_2nd_congressional_district))](Minnesota_US_Congressional_District_2.png) The first step in any data analysis is to load in your data. In this case, we need to open and read in `district_overall_2018.csv`, which has information about elections to the House. Start a new Python script called `elections.py`. In it, use the built-in function `open` to open the data file like this: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Python data_file = open("district_overall_2018.csv") ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For this to work, `district_overall_2018.csv` must be in the same directory as `elections.py`. Now that we've opened the file, we can start reading it in. First, though, we need to actually look at the file to see how the data is arranged. Open `district_overall_2018.csv` (either in a text editor like Atom or in Excel) and take a look around. You may find it useful to refer [documentation from MEDSL](https://github.com/MEDSL/2018-elections-official/blob/master/codebook-2018-election-data.md) about the variables (i.e., columns) in the data. from Note the answers to the following questions (not to turn in, just to orient yourself) - What is different about the first line of the file? - How many fields (i.e., columns) are there? As a Comma Separated Value (CSV) file, commas appear between the fields on each line - Why do some lines have `NA` for the candidate and party? - Which fields would you need to get the answer to our question (who were the candidates in Minnesota's 2nd district)? You probably noticed that the first line contains the names of each field, and that the rest are the actual data. We'll read in the first line separately with `data_file.readline()` and then read in the rest all at once with `data_file.readlines()`. `elections.py` should now look like this ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Python data_file = open("district_overall_2018.csv") first_line = data_file.readline() # returns a string of the next line in the file (in this case the first) data_lines = data_file.readlines() # returns a list of all remaining lines in the file data_file.close() # you should always close a file when you're done with it ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ At this point, we have the raw data, but it's not in a very convenient form. If you add `print(data_lines[0])` to `elections.py`, you'll see that our data is still in text form rather than divided into columns. Fortunately, Python strings have a [`split` method](https://docs.python.org/3.7/library/stdtypes.html#str.split) that will do exactly what we need. Follow the link to the documentation on `split` and take a moment to see if you can figure out how we can use it to make our data easily accessible. In the documentation, there's an example of using `split` to divide up values separated by commas, which is what we need to do. Add `print(data_lines[0].split(","))` to `elections.py`, and note how what's printed is different than `print(data_lines[0])` (the argument to `split` determines what character is uses to divide up the string, in out case a comma). `split` turns a line of our raw text into a list where each element is a entry in the data! With the data in list form, we can use list indexes to access specific parts of the data. To print out the candidate from that first line we could write `print(data_lines[0].split(",")[10])`. Where did `10` come from? If you print out `first_line.split(",")`, you'll see that the `candidate` field is the 11th entry, meaning it is at index 10 of the list. We don't want to be calling `split` every single time we access some part of the data, so we'll use a loop to replace each of our lines with the list version `split` produces: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Python for i in range(len(data_lines)): data_lines[i] = data_lines[i].split(",") ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Now to find the answer to our question. One approach would be to loop through every line in the data and extract just the information we're looking for. Since there were multiple candidates competing in the 2nd District in 2018, we'll need a list to keep track of them. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Python candidates = [] for line in data_lines: if # FILL IN THE APPROPRIATE CONDITION: candidates.append(line[10]) # append adds an element to the end of a list print(candidates) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The condition you fill in will need to use `and` to check for a combination of things. Looking through `district_overall_2018.csv` will help you identify the relevant fields. For example, we're interested in an election in Minnesota, so we'd want `line[1] == "Minnesota`. That alone would not be sufficient because it would be true for *all* Minnesota elections, and we're interested in a *specific* district. Once you have the correct condition, the candidates printed out should match the ones listed on [Ballotpedia](https://ballotpedia.org/Minnesota%27s_2nd_Congressional_District_election,_2018). Finally, this homework requires that `elections.py` write the answers to these questions out to a file called `report.txt` rather than printing them. As we saw in class, we open a file that we can write to with `out_file = open("report.txt", "w")`, where the second argument `"w"` tells Python to open the file in *writing mode*. Remember that we would then use `out_file.write("text going to the file goes here\n")` to write each output line. We have to put the newline character `"\n"` at the end because, unlike `print`, `write` does not automatically include it. Putting everything together, `elections.py` should now look like ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Python data_file = open("district_overall_2018.csv") first_line = data_file.readline() # returns a string of the next line in the file (in this case the first) data_lines = data_file.readlines() # returns a list of all remaining lines in the file data_file.close() # you should always close a file when you're done with it for i in range(len(data_lines)): data_lines[i] = data_lines[i].split(",") out_file = open("report.txt", "w") candidates = [] for line in data_lines: if # FILL IN THE APPROPRIATE CONDITION: candidates.append(line[10]) # append adds an element to the end of a list out_file.write("The following candidates ran to represent Northfield in the House in 2018:\n") for c in candidates: out_file.write(c + "\n") out_file.write("\n") # add a blank line to separate this from the answer to the next question # CODE FOR THE REST OF THE HOMEWORK GOES HERE out_file.close() # close the output file at the very end of the script ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You're now ready to tackle the questions in the rest of the homework. You should continue extending `elections.py` to write the answers to `report.txt`. Remember to refer to the [Advice Section](hw3.md.html#advice) for tips and guidance. # Computing Congress Use `district_overall_2018.csv` to answer the following questions. ## Representation Each US state has a number of representatives in the US House of Representatives proportional to that state's population (in contrast to the US Senate, where each state gets two senators). Add code to `elections.py` to use the election data to compute the number of representatives from each state. Your code should write the results (one line per state) to `report.txt`. You can check your work by comparing your results to [this Encyclopedia Britannica entry](https://www.britannica.com/topic/United-States-House-of-Representatives-Seats-by-State-1787120). If there's something off with the Florida number, that's not necessarily a sign of an error. `district_overall_2018.csv` omits four Florida House races where there was only one candidate on the ballot. ## Margin of Victory In most US elections[^othervoting], the candidate with the most votes on Election Day wins. How much or how little a candidate wins by (i.e., the difference between the winning vote total and the second place vote total), however, can influence how the result is interpreted and how the winner behaves in office. For example, a Representative who won by the slimmest of margins might be more politically cautious than one who won in a landslide[^landslide]. Add code to `elections.py` to find the closest and most lopsided **contested** House races of 2018 (**measured in number of votes between first and second place**). For each one, it should write a line to `report.txt` describing the seat (i.e., state and district as in `Washington District 3`) and this margin of victory. Just for fun: what about the district or the candidates do you think contributed to the most lopsided race? [^othervoting]: exceptions include two-round runoff elections as in [Louisiana](https://en.wikipedia.org/wiki/Elections_in_Louisiana) and the [ranked-choice voting recently instituted in Maine](https://en.wikipedia.org/wiki/2016_Maine_Question_5) [^landslide]: *won in a landslide* is a phrase meaning a candidate had a huge margin of victory ## Competition One feature of American democracy (and representative democracies in general) is the notion of *safe* districts and *competitive* districts. For example, Minnesota's 5th Congressional District is a safe district for [Democratic–Farmer–Labor (DFL)](https://www.dfl.org/)[^dfl] candidates, with those candidates typically getting well over 60% of the vote. This district has been represented by members of the DFL party since the 1960s. In contrast, Northfield's district, District 2, has recently become competitive with the 2016 and 2018 elections seeing the second-place candidate within 5% of the winner. Add code to `elections.py` to find all the competitive House elections in 2018, where an election is considered competitive if the first and second place candidates are separated by less than 10% of the total vote. Your code should write the a line to `report.txt` stating the number of competitive elections followed by a line describing each competitive seat (i.e., state and district as in `Washington District 3`). [^dfl]: Minnesota's own version of the Democratic Party formed in 1944 by a merger of the Minnesota Democratic Party and the Minnesota Farmer–Labor Party # Making a Statement Use `state_overall_2018.csv` to answer the following questions about Minnesota elections. You will need to read in and process this file in the same way you did for `district_overall_2018.csv`. ## From Grand Portage to Luverne In local elections (i.e., those below the federal level), officials such as Governor are elected in *statewide* elections, meaning they are elected by all voters in a state. Other officials, such as State Senators, are elected by voters in a specific district. Take a look through `state_overall_2018.csv`--how does it distinguish between candidates in statewide elections and those running in a specific district? Add code to `elections.py` to count the number of statewide candidates running in Minnesota in 2018. Your code should write a line to `report.txt` with this information. In cases where two candidates ran as a combined ticket (i.e., `Governor & Lt Governor`), count them as two statewide candidates. ## D) None of the Above In addition for voting for candidates on the ballot, voters can instead cast votes for candidates (or anyone, really) that *don't* appear on the ballot. These are called *write-in* votes, as the require the voter to literally write in the name of the person they want to vote for. Very rarely a write-in candidate can even win, as was the case for [Alaska Senator Lisa Murkowski's reelection in 2010](https://en.wikipedia.org/wiki/2010_United_States_Senate_election_in_Alaska), but more often a write-in vote is intended as satire or protest (see the [120 write-in votes for Donald Duck in the 2010 general elections in Sweden](https://www.telegraph.co.uk/news/worldnews/europe/sweden/8021258/Donald-Duck-and-God-mar-Swedish-election.html)). Add code to `elections.py` to find the Minnesota election with the most write-in votes. Your code should write a line to `report.txt` with both the office and the number of write-in votes. ## Why Bother? If you check your previous answer in `state_overall_2018.csv`, you'll see that the Minnesota election with the most write-in votes consisted of a single candidate running unopposed. It's not surprising more voters chose to write-in someone or something else when there was no choice to be had. This raises an interesting question--to what extent do voters just not bother voting in uncontested races at all even as they fill out the rest of their ballot? A simple way of assessing this is to compare the total number of votes cast (called the *turnout*) in contested (more than one candidate) and uncontested (only one candidate) statewide races. Add code to `elections.py` to compute the **average** total votes for contested statewide elections in Minnesota and the **average** total votes for uncontested statewide elections in Minnesota. Your code should write two lines to `report.txt` with these averages. # What's Up With Gerry? (OPTIONAL) Use `state_overall_2018.csv` to answer this question about Wisconsin elections. You may wonder, where do the districts in which candidates compete come from? This can vary considerably, but in many states it's the state legislature that determines the boundaries of districts for both federal and state elections. This opens up the possibility that the party in control the state legislature could draw these boundaries to their own advantage. This practice, called [*gerrymandering*](https://en.wikipedia.org/wiki/Gerrymandering), has been a part of US politics since before the United States was even an independent nation. In one of several recent high-profile legal challenges to gerrymandered district boundaries, in 2015 a group of voters in Wisconsin [sued to overturn a 2011 redistricting plan](https://www.brennancenter.org/our-work/court-cases/gill-v-whitford) for the Wisconsin State Assembly. While the suit was eventual unsuccessful at the US Supreme Court in 2019, the Wisconsin State Assembly nevertheless presents an interesting case study of what an aggressive partisan gerrymander can achieve. A very simple way of analyzing the "fairness" of legislative districts is to compare the share of the overall vote each party got with the share of the total seats they won. That is, if the Python Party got 45% of the overall vote while the Java Party got 30%, then we would expect the Python Party to control about 45% of the seats in the legislature and likewise for the Java Party to control about 30% of the seats. Even in entirely fair districts, vote share and seat share won't necessarily match perfectly, but if they are vastly different, it may indicate gerrymandering at work. Add code to `elections.py` to compute the following four quantities for Wisconsin State Assembly elections: - Percent of the overall two-party vote received by Democratic candidates - Percent of the State Assembly seats won by Democratic candidates - Percent of the overall two-party vote received by Republican candidates - Percent of the State Assembly seats won by Republican candidates The two-party vote is the total votes cast for either Democrats or Republicans. Your code should write a line to `report.txt` for each of these computations. # Advice 1. Note how in [Getting Started](hw3.md.html#gettingstarted), we got the candidate from a particular line with `line[10]`. This is not ideal because we would need to remember that index 10 means candidate throughout our code. You might consider writing a function like ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Python def get_candidate(line): return line[10] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ and then using `get_candidate(line)` instead of `line[10]`. It might make your code much easier to read and write if you create a function like this for every field you need to access. If the data is numerical (like vote totals) or boolean (like whether or not this line represents write-in votes), these functions could do the proper conversions using the `int` function or comparing to the relevant string. 2. In the [Representation](hw3.md.html#representation) question, you need to compute something for each state. For this, it might prove useful to have a list of states to loop over. You could do ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Python states = [] for line in data_lines states.append(line[1]) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ but then you'd have many duplicate entries. Instead, you can use Python's capability to check whether an element is present in a list to make sure you append each state to the list only once: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Python states = [] for line in data_lines if line[1] not in states: states.append(line[1]) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Note that `x in nums` is `True` if `nums` contains `x`, whereas `x not in nums` is `True` if `nums` does not contain `x`. In using your list of states, you may want to loop over another list for each state. You can do this by putting one `for` loop inside another (called *nested loops*). This example would print out the number of lines in the data file that correspond to each state (not the same as the number of representative, but similar): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Python for state in states: count = 0 for line in data_lines: if line[1] == state: count = count + 1 print(state, count) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Nested loops will frequently be necessary on this homework. 3. Some questions ask that you write a description of a seat (i.e., state and district) to `report.txt`. You can get a string with the state and district separated by a space with `line[1] + " " + line[7]`. 4. The [Margin of Victory](hw3.md.html#marginofvictory) question asks you to find the closest House election in 2018. One strategy for finding the minimum element in a list is using a variable to keep track of the *minimum element you've found so far*. Going back to the example of counting the number of lines in the data for each state, let's say we wanted to figure out which state had the least lines. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Python min_state = "" min_lines = 10000 for state in states: count = 0 for line in data_lines: # this loop counts the number of lines for the current state if line[1] == state: count = count + 1 if count < min_lines: # we've found a state with fewer lines min_lines = count min_state = state # once we finish looping over every state, min_lines and min_state will # match the state with the fewest lines because we've checked them all print("The state with the fewest lines is", min_state, "with", min_lines, "lines") ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ `min_lines` keeps track of the minimum number of lines for a state we've seen so far. Every time we find a state with fewer lines, we update its value along with `min_state`. We start `min_lines` at a very high value to make sure it and `min_state` get updated immediately. If we started `min_lines` at 0, it would be smaller than the number of lines for any state, and so would never be updated. 5. For questions that involve computing the difference in votes between the top two candidates or determining the winner of an election, it may be useful to put a list in sorted order. The `sorted` function takes a list as a parameter and returns a new list in ascending order. See the code below for examples or refer to the [documentation](https://docs.python.org/3/library/functions.html#sorted). ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Python nums = [3, 5, 2, 12, 1] nums_sorted = sorted(nums) # does not affect the original nums list at all print(nums_sorted) # prints [1, 2, 3, 5, 12] nums_reversed = sorted(nums, reverse=True) print(nums_reversed) # prints [12, 5, 3, 2, 1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 6. This homework asks you to write output to a file instead of using `print` to display it on the terminal. There are a couple important differences between `print` and the `write` method. As we discussed in class, `write` doesn't automatically put a new line at the end, so you'll need to include `\n` at the end of every line. Second, `write` takes only a single string, where we can pass `print` any number of arguments and it will combine them with spaces. See the code below an example of equivalent `print` and `write` calls (except for one sends the output to the screen and the other to a file). ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Python outfile = open("report.txt", "w") state = "Minnesota" count = 8 print(state, "has", count, "representatives") outfile.write(state + " has " + str(count) + " representatives\n") ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Since we are using `+` to concatenate different parts of the output into a single string, we need every term to be a string. `count`, however, is an `int` and would cause an error if we tried to add it to a string. Hence, we need to use the `str` function to turn it into a string. Note that you also have to take care of putting in spaces when using `write`. # What to Turn In Submit the following files via the [Moodle Homework 3 page](https://moodle.carleton.edu/mod/assign/view.php?id=487970). You **do not** need to submit either of the CSV files. - A file called `elections.py` that when run uses the provided data files to create (i.e., open in writing mode) a file called `report.txt` containing the answers the following questions (see the sections above for more detail): + Who were the candidates in the Minnesota Congressional District 2 election? + How many representatives does each state get in the US House? + What was the closest House race? + What was the most lopsided House race? + How many competitive House races were there what were they? + How many candidates ran statewide in Minnesota? + Which Minnesota election has the most write-ins? + What was the average turnout for both contested and uncontested statewide Minnesota elections? + (OPTIONAL) What percentage of the overall two-party vote did Democratic candidates for the Wisconsin State Assembly receive? + (OPTIONAL) What percentage of the Wisconsin State Assembly seats did Democratic candidates win? + (OPTIONAL) What percentage of the overall two-party vote did Republican candidates for the Wisconsin State Assembly receive? + (OPTIONAL) What percentage of the Wisconsin State Assembly seats did Republican candidates win? - A file called `feedback.txt` with the following information (you get points for answering these questions!): - How many hours you spent outside of class on this homework. - The difficulty of this homework for this point in a 100-level course as: too easy, easy, moderate, challenging, or too hard. - What did you learn on this homework (very briefly)? Rate the educational value relative to the time invested from 1 (low) to 5 (high).