The Shifting Definition of Newsworthiness

Mapping the New York Times’ Datelines Since 2015

Alex
The Startup

--

I read Bari Weiss’ resignation letter from the New York Times with some perplexity. In particular, I found her claim that she “was hired with the goal of bringing in voices that would not otherwise appear in your pages” a bit strange: Weiss is, after all, a wealthy graduate of Columbia who’s lived in the Northeast her entire life.

I’ve been playing around with New York Times archive data for some time and wanted to test Weiss’ claim. Has there been a (quantitative) change in the news the New York Times produces, especially since November 2016?

In this post, I’m going to take a look at what we can learn about “all the news that’s fit to print” from the metadata available in the archive — things like word counts, bylines, and news divisions. I’m looking for ways to approximate the “diversity” of the Times’ coverage since 2015. Essentially, we’re looking for real evidence of all those reporters sent to diners in Trump country in 2016 and after.

(For reference, here’s a map of all the IHOPs in the country. IHOP seemed like a good proxy for diners generally, and Waffle House is too concentrated in the south to be meaningful nationwide — it’s actually not that far off from what we have.)

The New York Times, as the largest and most successful American newspaper (and as far as I can tell, the biggest one with a publicly available API) offers an interesting case for analysis: the Times went to a paywall in 2011, and its emphasis on producing content to fuel subscriptions has accelerated since then; simultaneously, engagement with the Times since Trump’s election has increased steadily. Bari Weiss aside, how has the Times navigated its way through competing currents in media?

Fit to Print

Let’s first look at the Times’ output in broad terms. The Times itself reported a 42% increase in the number of paid (digital) subscriptions between December 2016 and December 2017. Certainly not all of that is attributable to Trump (we can gesture at secular media trends and the effectiveness of the Times’ own strategy there), but some of it certainly is due to Trump: the Times told CNBC in November 2016 that they had seen startling subscriber growth in the three weeks since the election, and the times rode that increase to the 42% year over year subscriber growth between 2016 and 2017 cited in their 2017 10-K.

Technical Notes

I’ve used the Times’ archive API to download all the available data, going back to January of 2015. I wrote a class to handle this for me, but the script is relatively simple: call the API, parse the JSON, save as a CSV, and repeat for each year, for each month. Because we get a full month’s archive with each API call, we should fall well under the rate limit of 4,000 calls per day.

The code I used to extract the archive is available at my github, here. The full script, an exercise in data engineering, will also allow you to post the archive to a relational database (Redshift on AWS in this case), but that’s beyond the scope of this post.

We can see that the overall output (in terms of stories published) has been decreasing over time:

Trump’s election does not appear to have had any effect here — the Times has steadily dropped the number of articles it’s published going back to 2015.

Is this true across all news desks at the paper? Or across all sections (i.e. politics, U.S. news, international news, etc.)?

Nothing jumps out from the chart, rather there’s a general across different sections of the paper. In terms of real numbers, across the top sections of the paper (i.e. the sections under which the most articles are published) the full table is available in the github repo, the last 10 months’ of aggregate data looks like this (throw out the last row as potentially incomplete):

But perhaps there’s more to the story than declining “output.” If we look not at the number of articles, but at the number of words devoted to each section, we can see where that output has gone. I’ve tallied the average word count by year in the following table:

So there are fewer articles being published, but those that make it to print (or the website) tend to be longer — growing at an average of about 7.5% every year.

The table below shows the top sections in every year since 2015, as well as the word count for all the articles published in each section. I’ve included the year over year change as well. (I should note that for 2020 articles I’ve prorated the output, so that although only 7 months of 2020 are fully available for analysis, we should be able to compare 2020 rates with other years available in the dataset.)

We can look at this data graphically, too:

And using the same analysis we did for the articles above:

Let me count the words

Some things stand out: as we’ve seen, it looks like the decline in the paper’s “output” is general, rather than particular. Most of the sections are down, albeit modestly. In all, the Times publishes about 45.5 million words each year. This has fallen slightly since 2015, by about 2.5% each year.

What we see is generally in line with our previous findings: there is some evidence of less content in fewer articles and fewer words, but those forces are generally working against one another—the Times is producing less, but more detailed content as part of a secular shift in strategy, from a business based on advertising to one based on digital subscriptions.

With these longer, more detailed pieces, the Times might have more room to send journalists to Trump country diners. Do we see that in their coverage?

The Politics of Intolerance

So, what was the Times covering in 2015? What was it covering then that it isn’t now, and what is it covering now that it wasn’t then?

We can look at this in a few ways. In this piece I’m going to look at the geography of coverage (at least, what I could find from some simple text mining), but some others include text analysis of both headlines and keywords, or looking at bylines and representation at the Times in more detail.

Technical notes

I’ve generated the data under analysis by parsing bylines in archived articles. This is, as far as I can tell, the easiest way to get geography from historical articles. The Times does offer a wire service that includes geographical information, but what I’m interested in here is historical data, not incoming stories.

The script I use to parse geography relies on the journalistic convention of the dateline — something like KABUL, Afghanistan; LONDON; or, as in one recent article, CENTER OF THE WORLD, Ohio (in an article about Ohio State football). Given the available data, this seemed like the best (/only) way to get the information we’re interested in.

A few caveats to this approach: articles that have a dateline like this are less and less common, not only at the Times but elsewhere in the news media as well. In many cases, we also miss out on geographic information from the dateline because there’s no geographic focus to the article — think opinion pieces, magazine posts, podcasts and other multimedia posts, as well as “geographically diffuse” stories about things like the coronavirus, or Biden’s VP pick.

More specifically, if we look at the number of articles with extractable geographic information by year, we have the following:

Not surprising, given that we know the number of stories overall is decreasing. But the same is true if we look at these kinds of stories as a percentage of all stories published, we get a clearer picture:

Looks like the percentage of stories with identifiable geography in the byline has remained roughly constant with a dip in 2018.

This suggests a few things: first, that there may have been a shift to more named places after Trump’s election (2017 has the highest percentage of “geographic” bylines at 28.5%) but it wasn’t major (the average was 26% of bylines containing geographic information). Second, though we’ve seen the Times drop the overall number of articles, they don’t seem to have shifted significantly away from “locatable” stories.

There’s one more thing we should look at with these aggregate measures: which sections are the most “diverse?” By “diversity” here I mean place diversity, without reference to the subjects or authors of the pieces in question (that material should be saved for a future analysis). I’m going to exclude foreign desk coverage here—it’s natural that a desk with a remit to cover stories from all over the world would feature articles from… all over the world—and focus on domestic stories. So, what does the diversity by news desk look like?

And finally, what is the count of unique places by year?

(Note that the 2020 number is low because we’re only halfway through the year. On a pro-rated basis, 451 places would translate to about 770 locations, putting 2020 in line with 2019.) It would seem that the Weiss was wrong about the Times’ direction (or at least, her hire didn’t exactly presage a new era of geographic diversity in their pages).

First let’s look worldwide.

On to the mapping. A first quick look at the distribution of (unique) byline geography looks ecumenical enough. The distribution is, naturally enough, concentrated in the United States (although looks like maybe there is no news made in Idaho or eastern New Mexico), Europe, and the Pacific Rim. (Note that this distribution is roughly the same one you might find from any other big news organization — this chart from a 2018 Forbes article shows the distribution of stories on CNN, Fox, and MSNBC).

What happens if we zoom in on the United States, and look at year over year changes? For this part, I’ve filtered the data down to include the United States. only. Let’s see what we have.

Some things stand out here: we have lots of coverage in the big cities — New York, DC, Los Angeles, Chicago (though we might expect more news about places closer to New York — commuters from Connecticut would be interested in the Metro section, for example) but five red states, Ohio, Iowa, Pennsylvania, Georgia and Florida, get a lot of coverage as well.

What about on a per-capita basis? Here, I’m using 2019 state population estimates from Wikipedia.

I left D.C. out because it was such an outlier it threw the rest of the map off. This is what we’d expect from political coverage especially—Iowa and New Hampshire get tons of coverage in presidential election years, enough to make them stand out across the entire sample.

One more hex map. What if we exclude political coverage? Here I’m filtering out all stories from the politics news desk.

Generally pretty similar, but we see some more emphasis on the Northeast, but still a lot of coverage of California, and, strangely, New Hampshire.

For a closer look at the impact of Trump’s election, we can look at the data from 2015–2018. Are there any changes we can see in the map over time?

We can see a lot more action in some of the swing states in election years. 2018 and 2016 in particular show a broader range of states covered in the Times’ pages.

Further considerations

Next we should look at what the Times is writing about in each of these places. In other words, what are the keywords most associated with each state?

Let’s start with Washington, D.C., which sees the most Times coverage.

As we’d expect, politics, and in particular Trump, dominate Washington coverage. It’s further not surprising that the Republicans get more coverage than democrats, since they control the Senate, White House, and Supreme Court (all of which are also featured prominently).

And what if we look at “diner country,” that is, states in the middle of the country where we’d already identified lots of Times coverage? Here’s what we have for Ohio:

And Iowa:

And finally, Pennsylvania:

It certainly does look like the Times invested heavily in coverage of 2016 swing states and covered the aftermath of the election. Curiously though, the Sports section also appears to be a major driver of “place diversity” in the Times’ coverage.

Summing up

So, have we found any evidence of diner journalism in the Times’ pages? Well, sure, some. It is true that, especially in election years, swing states in “flyover country” get more attention. In general, the Times looks more or less like any other big news organization, focusing on the major cities in the United States, especially in the Northeast. If you live in New Hampshire, you have the good fortune of having the most words per capita written about your state by the Times.

There may be some truth to what Weiss wrote in her resignation letter; the Times does, especially in election years, spread its coverage out to states across the country. We’d expect nothing less of a national news organization. However, we’ve also seen that fewer and fewer unique places are getting covered by the Times, albeit the articles that are being produced now are longer than in years past.

In all, the New York Times really does produce news from all over the globe and all over the United States; I’d add that the Times has also committed itself to a diversity of voices on its op-ed page, and had done so before they hired Bari Weiss; the Times has long had (dishonest) conservatives on its payroll, elevating voices that would otherwise be heard in, say, the pages of the New York Post. (The Times also proclaims that it is committed to publishing a diversity of letters to the editor in its op-ed pages.)

I haven’t discussed another important aspect of diversity at the New York Times: the identities of the people producing and gatekeeping the stories that get written. This is, I would argue, just as important as any focus on geographic representation, and requires a deeper treatment than whatever we can get at from individual lines in individual stories. Weiss is right that the news business has the power to elevate people’s stories and voices; there may be a commitment to doing so in its coverage. We should encourage them to do the same in the boardroom.

--

--

Alex
The Startup

Delivering the finest gymnosophistry west of the Indus. An occasional blog about projects I’ve undertaken, usually focusing on data and analytics.