Morbid Statistics with R

Published in

Level Up Coding

11 min readApr 7, 2020

Moving on from the (relatively) lighter topic of getting to know and use some Google APIs, I wanted to do something that focused on the actual analysis of some real-world data. That’s where this week’s dataset of the week comes in — a list of police killings maintained by the Washington Post, and available here. I’ll be looking at some interesting (interesting?) features of the data and conducting a few inferential tests with R. Some of the questions we’ll want to answer are things like, “Is there a relationship between race and fatal police shooting?” (yes) and “Does the mean age of a fatal police shooting victim differ between races?” (also yes). Let’s see what we can find.

My previous post was in Python, so I’ll be doing this exercise in R. The first step is to actually get our hands on the data:

library(readr) 
df <- fread("https://raw.githubusercontent.com/washingtonpost/data-police-shootings/master/fatal-police-shootings-data.csv")str(df) # should print the following

So there’s some cleanup we need to do. First, let’s look for NAs or missing values. The easy way to do this is to apply an is.na function across the df. Although this does show no NAs, we should be careful about looking for values that are actually NA, but not coded as such.

One place we should definitely ensure we clean up is the race column, because that’ll be a primary column of interest. If we look at value counts with the table command, we’ll see there are a number of missing values that aren’t coded as NA:

So we have some 617 shootings where the race of the victim is blank. Because we don’t have a good way of dealing with these we can discard these observations.

library(dplyr) # install packages 
df <- (df %>% filter(race != '')) #remove rows w/o race

So now our data has race coded for each observation and is something we can work with. We need to do the same thing with the gender column to ensure we have only “M” and “F” values.

Next, we need to consider our variables. R is treating many of the variables here as strings (‘chr’ or character) when in reality they are factors or categorical variables. In this case, we’ll convert id, manner_of_death, armed, gender, race, city, state, signs_of_mental_illness, threat_level, flee to factors. body_camera we can keep as a logical variable, and we’ll keep age as a numeric value. My code, which is ugly, does this one variable at a time. The better way to do mass conversions is with dplyr:

cols <- c('id', 'manner_of_death', 'armed', 'gender', 'race', 'city', 'state', 'signs_of_mental_illness', 'threat_level', 'flee')df[cols] <- lapply(df[cols], factor) #lapply applies the function to all columns

So now we should have the following, each column should be of the appropriate data type, and we shouldn’t have any extraneous levels to our factor variables:

Let’s get to work looking at the data then.

Let’s see if there’s anything to some of those questions posed at the head of the article. How about age breakdowns of police shooting victims?

library(lattice) 
xyplot(age ~ race, jitter.x = T, data = df) #prints the following

And what if we look at the same statistic but include gender?

xyplot(age ~ race | gender, jitter.x = T, data = df) #prints the following

To begin with, we can see that there are far more men than women involved in these shootings, but it also looks like there’s a pretty wide age range among the different races. Whites appear to have the largest spread, while Blacks and Hispanics tend to be somewhat more concentrated at younger ages.

Histograms will show the distribution more clearly. We shouldn’t expect too much spread here: summary(df$age) shows that the 3rd quartile of our data is at 45 years old, so most of our observations will skew somewhat younger. To illustrate the histogram approach (and to keep things from getting too crowded), I’ll plot the histogram with only White and Black victims.

library(ggplot2) # for plotting plt <- df[df$race == 'B' | df$race == 'W',] %>% ggplot( aes(x = age, fill = race)) + 
  geom_histogram( alpha = .6, position = 'identity') +
  labs(fill = '')
plt # prints the following:

Here we can see that there are actually more White victims than Black victims, but that ages of Black victims are skewed younger (although the youngest victim does appear to be White). Let’s go ahead and add in the mean and median for each group so we can get an exact feel for the difference.

We can also use a facet wrap to get six histograms at once (here we’ll use a grayscale for visibility, and because the default ggplot color palette is a little… bright for the data we have here):

plt <- df %>% ggplot( aes(x = age)) + 
  geom_histogram( alpha = .9, position = 'identity') +
  facet_wrap(~race)
# plt ## print to see chart -- prints the following:

So again, looks like we have a plurality of White victims, but Black victims skew younger; Black, White, and Hispanic victims make up the vast majority of our sample. We should add in some descriptors of the central tendency. Let’s add lines representing the mean of age of each victim by race. The easiest way to do this is to add a new column to our original DataFrame with the median of each group. With dplyr this is relatively straightforward:

df <- df %>% group_by(race) %>% mutate(med = median(age), 
mean = mean(age) # Adds columns for median and mean for each group

We can now check the central tendency for each facet:

plt %+% geom_vline(aes(xintercept = med)) 
%+% geom_vline(aes(xintercept = mean), lintype = 'dotdash') 
# prints the following

So now we can see the median (solid line) and the mean (dot-dashed line) for each racial group.

Are any of the means significantly different from one another? We can compute some statistics to test for significant differences between the means. First, we check the actual numerical values of the mean age among each racial group:

df %>% group_by(race) %>% summarize(mean_age = mean(age))
#prints the following

Now let’s see if the differences among means are statistically significant. We’ll do two tests here: a t-test and a Tukey Honest Significant Difference test. The t-test assumes that the distribution of the variable sampled (in our case age) follows a normal distribution, which, pace demographers, is probably true in this case (i.e. we can assume that among all American adults, ages follow a normal distribution) and that the variances of the samples are true. I haven’t computed the variances of the samples, but R will allow us to change a parameter indicating whether the variances are true. In this case, we won’t make that assumption and so we’ll be performing a version of Welch’s t-test instead of the more familiar student’s t-test. Since we’re limited in this case to comparing two means, let’s compare Black and White victims.

b_and_w <- df[df$race == 'B' | df$race == 'W',]
b_and_w$race <- droplevels(b_and_w$race) # can only have 2 levels
mod1 <-  t.test(b_and_w$age ~ b_and_w$race)
mod1#prints the following

We can see that there is in fact a significant difference between the ages of Black and White victims; Black victims tend to be younger by ~7 or so years, which is significant at the 95% level.

Instead of constructing t-tests for every pair of means (with six different factor levels, we’d have to build 15 different tests — note, though, that R does make this somewhat easy by including a paired option in the t-test function. This isn’t a two sample test, but will compute the test statistic between each pair of groups of interest), we can generate an ANOVA table and use the Tukey HSD test to check the difference between each race’s mean age and the “grand mean” i.e. average of all victims taken together. Typically we use these kinds of tests in so-called “randomized block” experiments, but it should work here as well. Here’s how to build it:

mod2 <- aov(age ~ race, df) tukey <- TukeyHSD(mod2)tukey #prints the following

To see the effects graphically, we can call plot(tukey, las = 1). The las parameter changes the orientation of the labels for readability — see DataCamp’s documentation. What we get from the plot is this figure:

Here we can see a number of pairs that deviate from 0, i.e. differ significantly from one another at the 95% confidence level. Those pairs are B-A, W-B, W-H, W-N, W-O, which notably include mostly White victims. (Note too that the confidence intervals and distances from 0 are highest between W-B and W-H victims.)

So we’ve determined that there is a statistically significant difference in the ages of some of these victim populations. What about the overall distribution of races in the sample? Because we’re looking at categorical data, the appropriate test to use is the chi-squre test of independence. What we’re testing here is the (null) hypothesis that race is independent of fatal police shooting, i.e. the races are distributed in the sample of shooting victims as we’d expect them to be based on the racial distribution of the population as a whole.

This article is a great introduction to chi-squares, which are a very common distribution found in all sorts of situations — you may be familiar with chi-squares from an introductory bio or genetics class (that’s where I encountered them first anyway) where genetic results are compared with expected distributions (Punnett squares anyone?). But there are other situations, like racial representation on juries or criminal justice where they prove useful as well.

First, what does the chi-square distribution look like? We can generate a curve in R with, naturally, the curve function. There are six races in our sample, so we know we’ll have 5 degrees of freedom. Let’s plot the curve for a chi-squre distribution with 5 degrees of freedom; we’ll shade the area corresponding to critical values of the test statistic as well (i.e. the area in which test statistics allow us to reject the null hypothesis).

curve(dchisq(x, df = 5), from = 0, to = 50) # print the basic figure ## add shading for the critical values at 95% upper <- qchisq(.975, 5)upper_95 <- seq(upper, 50)
p_upper_95 <- dchisq(upper_95, df = 5)curve(dchisq(x, df = 5), from = 0, to = 50)polygon(c(upper_95, rev(upper_95)), c(p_upper_95, rep(0, length(p_upper_95))), 
        col = adjustcolor('red', alpha = .3), border = NA)## note: adapted from https://www.statology.org/how-to-easily-plot-a-chi-square-distribution-in-r/

Note that the critical value for a chi-square with 5 degrees of freedom is given by qchisq(.975, 5) = 12.8325, so we know that if we run the test on our data and find a test statistic greater than 12.8325 we can reject the null hypothesis and conclude that the observed racial proportions in our sample do not match the expected distribution.

Typically when we look at chi-square tests we are looking at 2x2 matrices — this article gives a good example looking at political party identification and views on a particular tax policy. Here though we have only our observed counts. We’ll need to build expected counts by looking at the demographics of the country. Luckily all we need here are the percentages. R is smart enough to build expected counts as long as we pass a vector containing the expected percentages in with our observed counts, so we can save ourselves some work and head on over to the Wikipedia article on demographics of the United States to get our expected racial breakdown. For simplicity, I’m going to assume that each victim was of one and only one race, and I’ll use the proportions given by the 2013–2017 ACS (American Community Survey — a common datasource for sociological and political science research). The head of the table from Wikpedia/ACS is below:

So there we have it; we just have to add in our expected proportions and compute the test statistic. R will throw an error if our percentages don’t add up to 1, but that’s ok because we can rescale whatever proportions we pass in. To generate the statistic then, we’ll first make a table of our data by race:

From the first line there we can see that our observed proportions differ substantially from what we’d expect based on ACS results. We now create a vector to store expected proportions:

props <- c(.0504, .127, .176, .008, .048, .615) 
# won't sum to 1 because of rounding errors, identification with two races, etc.

Finally, we compute the test statistic for our data with

chi <- chisq.test(tbl, p = props, rescale.p = T)
chi

And, as expected, our statistic is large enough to reject the null hypothesis:

There’s… not really a lot of doubt about the significance of the results given a test statistic this large. Based on this sample, it’s clear that the racial breakdown observed in fatal police shootings is not independent of race, and that the observed distribution deviates significantly from the expected distribution given the demographics of the United States as a whole.

There are a number of ways to extend analysis of this dataset. We could look at body cameras, whether victims were armed (and if so, with what), and look at geography. But I’ll save further investigation for further posts. I’ll close by noting that, as news of coronavirus infections and deaths continues to come in from places like Detroit and New Orleans with high minority populations, we’re seeing other scenarios where analysis like this can be used. Minorities and the poor are disproportionately suffering from the coronavirus in these areas—all we need to do is look at some chi-squares to know for certain that this is not a virus that is proportional in its death rate.

Morbid Statistics with R

Written by Alex