What to Read Next? Analyzing the Digital Fragmenta Historicorum Graecorum with Python and NLP

Published in

The Startup

13 min readMay 21, 2020

Since the onset of COVID-19 and social distancing guidelines I, like many others, have been looking for ways to assuage anxiety. I’ve managed to find some comfort (and edification) in the Greek and Latin classics. Now is as good a place as any in this story to offer Thomas Jefferson’s evergreen, and, in some circles, famous quote:

[T]o read the Latin & Greek authors in their original is a sublime luxury … I enjoy Homer in his own language infinitely beyond Pope’s translation of him, & both beyond the dull narrative of the same events by Dares Phrygius, & it is an innocent enjoyment. I thank on my knees him who directed my early education for having put into my possession this rich source of delight: and I would not exchange it for any thing which I could then have acquired & have not since acquired.

In a previous life I was a classicist. I studied the ancient novel, and wrote a thesis about Heliodorus (which I believe is the only thesis in Washington University’s Open Scholarship library tagged with “Byzantine Literature”). When I was in college and later graduate school, I relied in no small part on digital versions of classical literature, and the tools built on top of those digital versions, like Perseus and the TLG. However, that was the real limit of my experience with work in the digital humanities — until, of course, restlessness in the time of Coronavirus led me to a new discovery.

The DFHG — Digital Fragmenta Historicorum Graecorum, or Digitized Fragments of Greek Historical Writers (perhaps more accurately Die digitale Fragmente der griechischen historischen Schrift) hosted by the University of Leipzig and created by Monica Berti and Gianluca Cumani, is a remarkable achievement. It’s a digitized version of fragmentary Greek historical writing, mostly by authors who aren’t typically taught in a classics curriculum.

Many people know the hits (everyone’s heard of Thucydides), but the DFHG is the place to go for deep cuts. Maybe you’re reading Gibbon and you come across an anecdote like this one: Edward Gibbon wrote that

[t]he historian Priscus, whose embassy is a source of curious instruction, was accosted in the camp of Attila by a stranger, who saluted him in the Greek language, but whose dress and figure displayed the appearance of a wealthy Scythian. In the siege of Viminiacum he had lost, according to his own account, his fortune and liberty: he became the slave of Onegesius but his faithful services against the Romans and the Acatzires had gradually raised him to the rank of the native Huns, to whom he was attached by the domestic pledges of a new wife and several children. The spoils of war had restored and improved his private property; he was admitted to the table of his former lord and the apostate Greek blessed the hour of his captivity, since it had been the introduction to a happy and independent state, which he held by the honourable tenure of military service. This reflection naturally produced a dispute on the advantages and defects of the Roman government, which was severely arraigned by the apostate, and defended by Priscus in a prolix and feeble declamation.

But wait… who’s Priscus? Where can I read this “source of curious instruction,” or find his “prolix and feeble declamation” on the Roman system of government? The DFHG is the answer.

The DFHG also comes with a set of tools, including an API (application programming interface) that make querying the corpus relatively easy (and it’s free). The search tools are limited, but there is a wealth of information available. That in turn brings me, finally, to the topic of this blog and a project: what can analysis of the DHFG tell us about the lexical complexity of its authors?

Why is this significant? If I were going to choose something to read — or teach — I’d want to make sure I spent my time reading and not checking vocabulary in the dictionary.

For this blog, I’m going to extract textual data from the DFHG, examine it with an NLP engine designed specifically for use with classical languages, and use the results to choose something to read. Let’s see what we find.

Getting the data

The DFHG’s API relies on queries of authors: this makes sense because of the fragmentary nature of the corpus — we don’t have specific “works” to search through, like The Republic, so author is the most atomic we can get. We thus need to extract a list of all the authors included in the corpus. We can then build out API calls and feed our data into an NLP engine.

This code outlines the steps I took to send a request to the DFHG’s API documentation, clean up the HTML, and get a (clean) list of authors.

Here is a quick check of the word counts from the fragments in our corpus (ignore the warning):

Latium est, non legitur

This is not a clean dataset — at least, not in the sense that we’re ready to do any actual analytics yet. The problem here is the nature of the material: these are fragmentary pieces of writing, some of which survive only through quotations in other sources. What makes life even more difficult is disentangling the “actual” authorial content from whatever may have accrued to the text at the hands of scribes across the centuries. As a more particular matter, what we have here are actually several fragments that are in Latin.

The DFHG does provide a Latin translation for most (if not all) of the fragments in the corpus, but when Latin comments turn up in the ‘text’ field of the data we scraped earlier, that suggests the influence of a commentator, not necessarily the author we want to look at. To proceed, we need to figure out a way to separate the content we want from the commentary we don’t. Luckily for us, Latin and Greek are different languages written in different alphabets. (It would be a different question entirely if we were trying to segregate, say, Spanish and English words when they’re bunched up together, since they use the same alphabet.)

Since Latin uses, naturally, the Roman alphabet, we can use regex (regular expressions — basically word searches) to find patterns of text in the data that are made up of Latin characters. I’m going to be pretty broad here: we want to capture as much of the Latin as possible while leaving Greek. If a passage is all “original,” we should let it stand. Otherwise, we should flag it. This line will pick up any row in our data that has a Latin character in the “text” field:

latin_index = author_df[‘text’].str.contains(‘([a-zA-Z])\)’, regex = True)

Let’s do some further cleaning to get the number of Latin words, their positions within the text, and the relative frequency of Latin/non-Latin words in each fragment:

We now have counts and frequencies. At this point we have to make some decisions about how to handle these data points: we can just drop these rows and move on, or we can try and keep the texts that have a minimal amount of Latin in them.

Try the second route. There are options here as well: we can either drop rows entirely, or try and filter out the Latin words from each text. I’m going to try a combination of both, going through the data first and trying to drop rows that are entirely or almost entirely Latin, then scrubbing the remaining text.

First, let’s look at the distribution of Latin in the corpus we have.

What these charts tell me is that our intuition was right — we have a few authors whose work preserved in the DFHG is almost entirely in Latin (i.e. those fragments that fall on the left hand side of the lower figure), and many others where a few stray Latin words have made their way into the mainly Greek text.

To make a final pass at cleaning the corpus, I defined a few functions to take a list of words and return the indices of all Latin words in the list. I then filter the list of words by deleting elements at those indices.

for index, row in final_df.iterrows(): 
    indicies = row['latin_word_index']
    for index in sorted(indicies, reverse = True): 
        del row['words'][index]

I then drop fragments with remaining Latin words.

final_df = df.drop(df[df['latin_word_count'] != 0])

NLP Time

Now we can move on to the NLP part of this project. I’m taking my cues from some of the example notebooks posted on the CLTK Github page. I’m going to calculate a few summary metrics, including lexical density, which measures the number of “lexical” words as a percentage of total words in a passage. A “lexical” word is a word that is not extremely common. These are called “stop” words (think something like “and,” or “or” in English) and we don’t want to include them in our analysis. Counting stop words doesn’t tell us much about a text, beyond an author’s penchant for conjunctions.

So, when we look at a text like the quote from Jefferson, above, we expect the lexical density to be somewhere in the middle, since there are a decent number of stop words, but few words are repeated. According to Analyze My Writing, the lexical density quote 48.24%, which is about what we’d expect.

We can take a first look at lexical density by fragment after some cleanup. We’ll then look at the lexical density by author. First, we need to regularize the texts and strip out punctuation and so-called stop words. (A special note on punctuation — Greek uses the semicolon ; to indicate a question, so we won’t include that character in the punctuation list to filer.) A stop word in this case is a common word that doesn’t add much to our understanding of the meaning of the text. In English these are words like “the,” “is,” “and,” and so on. The CLTK comes with a built-in list of stop words for both Greek and Latin, so I’ll use that functionality to clean up the fragments. This code will clean up the list of words we’ve extracted from each fragment:

def filter_punct(words): 
    out = [word for word in words if word not in punct]
    out = [word for word in out if word not in stops]final_df['words'].apply(clean_punct)
final_df['lemmata'].apply(clean_punct)## Note that this function actually cleans both punctuation and stop 
## words final_df[['author', 'clean_lemmata']] # prints the following

Now a quick check of the length of each fragment we have:

And finally we can now calculate the fragment level lexical density for our corpus:

def lemmatize(text): 
    lemmata = lemmatizer.lemmatize(text) # from cltk
    return lemmatadf['lemmata'] = df['clean_words'].apply(lemmatize)df['lexical_density'] = df.apply(lambda x: len(set(x['lemmata'])) / len(x['lemmata']), 1) # python set return set of unique elements

Let’s remove the fragments whose lexical density is 1 and plot to see what kind of distribution we have. (I’m omitting fragments where density is 1 for two reasons. First, these fragments tend to be much shorter — the average word count for a 100% density fragment is ~10 words, while those with lower densities average ~82 words. Since these are fragments, it’s likely that passages with 100% density are not representative of an author’s voice. Second, shorter fragments will almost by definition have higher lexical densities than longer fragments. Because shorter fragments are made up of fewer words, there are fewer opportunities for an author to repeat herself, and fewer repetitions leads to a “denser” text.)

Looks like the distribution is approximately normal, but with lots of left skew. We can check the descriptive statistics for more:

Combinatio Nova

So now we have some things to say about individual fragments, but fragments are, well, fragmentary. Can we say anything about the styles of a particular author? To do so, we’ll need to group our passages by author.

This is straightforward in Python because of the way lists behave. If we have two lists in Python, their sum is the union of the two sets, i.e. we’ll get all elements from both lists back in one.

We can apply this to our corpus by grouping by author then summing up our words. We’ll recalculate lexical density on an author by author basis, and see which authors favor particular words.

We get a different picture of lexical density if we look at things at this level, instead of at the fragment level:

And again removing authors whose density is 1:

So we now have what looks a bit more like a normal distribution of lexical densities, which is — roughly — what we’d expect. Some authors are going to be more difficult, others a little simpler in their writing. Let’s take a look at the densest authors with over 100 words in the corpus:

words_df[(words_df['lexical_density'] >= .64) & (words_df['word_count'] > 100)]\
.sort_values('word_count', ascending = False)#.64 is the median density

Conversely, if what we want to do is find someone with a sufficient sample size and a low lexical density (and answer the question posed at the beginning of this blog) we can execute the following:

words_df[words_df['word_count'] > 250]\
.sort_values('lexical_density', ascending = True)# The median word count in our cleaned sample is 169. 
# 250 gives us the 65th percentile

Looks like Joannes Antiochenus (John of Antioch, a 7th century chronicler) is our guy, by this metric at least. Can we elaborate on this with a more sophisticated check of the difficulty, if not the density of each author?

Ad pedem litterae

Let’s try and evaluate each author’s style. We’ll want to get word counts, then see which authors use hapax legomena — unique words.

I tried two variations: One with Python’s Counter method, and another with itemgetter. Itemgetter returned tuples, which I ultimately ended up using to create the frequency distributions below. Either method should work.

Let’s now take a closer look at the words our authors are using. We can repeat the process above, but on the corpus as a whole.

This list is basically what you’d expect to see. Greek 101 students learn all of these words in the first few weeks of class. εἰμί (eimi) is the verb “to be”, for example, οὐ (ou) is “not,” and so on.

Which author uses the rarest words, and which author uses the most common ones? I’m going to approach this question by building on our word counts and developing a word frequency counter. We can then calculate an author’s mean frequency of use — an ad hoc metric that should describe, along with lexical density, the lexical difficulty of an author’s work.

I’m going to normalize each word according to occurrences per 10,000 words using our word counts data frame. We have everything we need already: the count of each word, and the total number of words in the corpus.

There are 296,113 words in our corpus, so we need to divide each count by 29.6113 to get our frequency (again, counting frequency per 10,000 words). With frequencies in hand, we can create a weighted average frequency to gauge the lexical difficulty of an author. Lower frequencies will tend to indicate more difficult authors, and higher frequencies easier authors (I write all this with the caveat that almost nothing about Ancient Greek is “easy”). There is little (if any) intrinsic meaning to a weighted average frequency like the one we’re building here, rather it’s better to think of the resulting number as an index of difficulty, and a first step toward evaluating the composition of a passage.

It might make a bit more sense to take the inverse of the frequency per 10k column as the weight for calculating our index. Doing so would do two things: first, it would reverse the direction of our frequency weights — so now higher numbers are harder to read, which might make more intuitive sense — and second, it would assign a much higher weight to words that occur less often in our corpus, because 1/x does not scale linearly as x approaches 0. Note however, that either way you run it, this is a slow function.

What I’m going to do is define a function to walk through the word count dictionaries we have. I’m taking advantage of the fact that we’ve already defined our frequency data frame with each word from the corpus.

What this function does is multiply the number of occurrences of a given word in an author’s body of work by that word’s relative frequency in the DFHG corpus. So, for example if we look at Abas (a little-known historian mentioned only in the Byzantine encyclopedia, the Suda) in our words data frame, we have the following lemmata:

What I’ve done is gotten the occurrences of, for example ἡρόδοτος (this is the proper name Herodotus, but lowercase) in the Abbas (here 4) and multiplied that by the inverse frequency we calculated above. What does the resulting distribution look like?

And for comparison’s sake, here are the distributions of lexical difficulties, one with the inverted frequency (bigger numbers mean more rare words) and the other with regular frequency (smaller numbers mean more rare words):

Does this inform our decision on whom to read or teach? The “easier” authors have more common words — that is, words found in more different works. It is “gateway” vocabulary in the sense that knowing those words helps with other authors, since they use more of the same words

Difficulty tends to scale with the length of an author’s corpus, but density does not. That is, writers gonna write (tenet insanabile multos scribendi cacoethes, after all), but rare words are rare, so even long texts can only be so “dense.” Turns out maybe reading Pricus’ prolix and feeble declamation isn’t so tough.