Build Your Own FiveThirtyEight Clone, Part 1

Alex
5 min readMay 1, 2020

It has been a minute since my last post, where I used R to look for evidence of racial disparities in police shootings. I got that data from the Washington Post, which leads me in a roundabout way to today’s post. I wanted to do some more exploration with Washington D.C. data; while there is a wealth of it — and much of that is actually cool and interesting stuff — I’d also started experimenting with Bokeh, a visualization library in Python that has native javascript integration and thus makes for nice, web-ready charts. That led me to look for good candidates for data to plot, so naturally I thought about presidential candidates (hopefully the Washington D.C. connection clearer now). (Also shoutout to Build Your Own Clone, a company that makes good, if finicky, DIY guitar effect pedal kits. Check them out here.)

I’ve long been impressed by FiveThirtyEight’s data presentation (judging from the fact that R’s ggplot2 library has a built-in ‘fivethirtyeight’ theme, I can’t be the only one), so the chance to try and build my own version with Bokeh’s interactive plots seemed like a good next project to take on. Bokeh also strikes me as easier to use than matplotlib, which is typically the go-to visualization library for Python. The ggplot2 syntax seems more intuitive to me, and Bokeh’s is similar. In this project we can combine some math, good Python, and visuals into something topical. (This seems like a good place to declare that I am not a pollster, and there are certainly nuances involved in the proper analysis of polling data that I am a) unaware of; and b) eliding over.)

The first step here is to choose and get the data we’ll look at. I’m going to focus on recent general election polls. For the purposes of this blog, I’m going to be pretty liberal with the definition of “recent,” and call everything since the beginning of 2020 “recent.” (Hey, this isn’t a blog about the actual politics involved here, I’m just trying to demonstrate some trends and techniques). I’m going to get the data from Real Clear Politics, using a Python library called, naturally, realclearpolitics. This is basically a parser for http requests that we make to RCP, but it’s a nice, simple package that will save some work on the front end. For the sake of ensuring that we’re comparing apples to apples here, I’ll only deal with national, general election polls (say what you will about their utility at this date).

First, we get the data. We can provide the rcp module with a candidate or a URL to locate polling data. I wanted to stay in the Jupyter environment, so I went the candidate route, which required a little extra parsing to get at the results that I wanted, because the module will return all polls featuring that candidate (so, state polls as well as general election polls).

Now we have the polling data we wanted. We can make a DataFrame with the following code:

res = rcp.get_poll_data(url) 
pd.DataFrame(res[0]['data']) ## prints the following
# note that all columns are strings // we need to convert polling to numeric # also note that in the fist line of MoE we have a '--' string. Convert those to np.nan df.loc[df['MoE'] == '--', 'MoE'] = np.nandf[['MoE', 'Biden (D)', 'Trump (R)']].apply(lambda x: pd.to_numeric(x)) # prints the following

There’s another nuance we should take care of here, and that’s the date format. Polls are difficult, costly things to produce, and difficult, costly things take time. Hence we see that for each of these polls we don’t have a date, but rather a date range. Since the desired output here is something that tracks polling over time, we need to figure out some way to deal with creating an index of those dates. I’m going to assign each poll to the middle of its range (think of this as taking the median of a date range). In other words, if we have a poll conducted from 4/1 to 4/3, I’d assign that poll a “date” of 4/2 (in the middle of the date range). This code will perform the necessary transformations:

df['start_date'] = [x.split('-')[0].strip() for x in df['Date']]df['end_date'] = [x.split('-')[1].strip() for x in df['Date']]df[['start_date','end_date']] = df[['start_date','end_date']].apply(lambda x: x + '/20') # Convert to datetimedf['start_date'] = pd.to_datetime(df['start_date'])df['end_date'] = pd.to_datetime(df['end_date'])df['poll_date'] = ((df['end_date'] - df['start_date']) / 2) + df['start_date'] # works because these are already timestamp objects

Let’s try plotting that. We’ll start with a basic line plot with all the polling numbers.

It’s pretty, but it’s basic. Let’s add some decoration so that our plot more accurately and precisely reflects what the polls are telling us.

Scatterplots can be tough to interpret on their own, without some indication of trend or tendency. There are a couple ways to do this. I actually started with a simple rolling mean, for example, but that led to an ugly and uninformative chart (I should’ve read this article on how to chart polling data first). I’m going to try and create a LOESS (Locally Estimated Scatterplot Smoothing) line to track polling responses. As far as I know, there is no Bokeh equivalent to ggplot’s geom_smooth option, so I’ll build one myself. I’m following this guide, written by Allen Downey last year.

Building a LOESS line is relatively straightforward given the statmodels library for Python. Doney’s code, which you can see in the notebook snapshot below, essentially uses the library as it’s designed to be used; the only additional prep step I took was to create a new DataFrame with the ‘poll_date’ column as the index.

Et voilà, we have some trends:

We could stop here; the image below is the same plot with some added window dressing.

Let’s calculate some confidence intervals for the “true” polling average of each candidate. In other words, we want to plot and represent our “best guess” for how each candidate is polling nationally. I’m cheating a bit here: I’m substituting the margin of error of each poll for a real confidence interval from the polling average.

This is the kind of plot you’re likely to see in the Economist (see the article I liked above), or even on FiveThirtyEight.

The FiveThirtyEight modeling engine is a souped-up version of what we’ve got here. Their model is, of course, more sophisticated, and actually issues predictions, while all this code does is generate a nice looking (and hopefully informative) plot. But what if we want to drill down on individual polls? How do pollster ratings coincide with what we can see in this chart?

I’ll explore those questions in the next article.

--

--

Alex

Delivering the finest gymnosophistry west of the Indus. An occasional blog about projects I’ve undertaken, usually focusing on data and analytics.