The Pleasure of the Text

Reading Latin at Scale

Alex
9 min readOct 15, 2020

I have an old edition of Ancient Greek passages for sight translation practice that opens with an exhortation from the German theologian Albrecht Ritschl: “Lesen, viel lesen, sehr viel lesen, möglischt viel lesen:” Read. Read a lot. A lot a lot. If at all possible, read a lot.

I’ve written several articles on writing software to make ancient literature more accessible. Where before I focused on direct analysis of a corpus or a word, for this project I wanted to build something that would help me live up to Ritschl’s advice and read—a lot—at scale.

When I was studying for my language exams in graduate school, one of the most useful resources I found was Geoffrey Steadman’s website. Steadman produces a series of Greek and Latin texts with vocabulary and commentary on the opposite page — much like Clyde Pharr’s famous “purple Vergil” that has saved many a high school student much labor. These are excellent for scaling one’s reading ability. I think my Latin and Greek are both strong, but I admit that even still I often find myself in search of a dictionary when reading anything more recherché than Ecce Romani.

In this project I walk though creating a Latin text with facing vocabulary at scale. I’ll be using Ammianus Marcellinus’ Res Gestae as my sample text. (Real historians might take this opportunity to make a comment about how we can read some of the travails of fourth-century Rome into our present moment; I’ll resist that temptation—at least until I’ve actually finished reading the text.)

I have a twofold goal here: maximizing readability and scalability. That means that I’m going to focus on widely available texts and on vocabulary. Of course one of the things that makes bilingual editions particularly useful is the inclusion of a commentary to help readers sort out some of the thornier grammatical and lexical issues one finds in reading any ancient text, but I don’t have the time (or the training) to write a useful commentary for anything (let alone everything) that I might want to read. Instead, I focus on providing a readable text (downloaded from Perseus or The Latin Library) and vocabulary on the same page. My program works with the python-docx library to create a Word doc as the final product. A user can then create a PDF or proceed with the Word file.

The program I’ve written does several things to produce a .docx output:

  1. Chooses a text from Perseus or the Latin Library. Between them, there are some 2500 texts from 650 different authors, though there is some overlap in the catalogs. Perseus texts tend to be of better quality, while The Latin Library has a deeper catalog. At the same time though, many of the LL texts are junky, with typos that will complicate the lemmatization and vocabulary lookup that we want to do;
  2. Adds short definitions for each word in a passage using the Open Words software, and grammatical information from Lewis and Short’s A Latin Dictionary. I also build a stop word list based on the corpus selected so the user doesn’t get overwhelmed with minutiae or common words (I don’t want to take up valuable page real estate with definitions of et or quam). I use this in addition to a list of high-frequency Latin words I built from the Latin Library corpus;
  3. Creates a layout like this (taken from Steadman’s edition of Sallust’s Bellum Catilinae):

and creates a Word file.

Finally, I’d argue for the utility of this program for not only promoting facility with Latin, but as a worthwhile programming exercise in itself. Take the time to read the source code and you’ll find that to get this to work (which it does… most of the time) I’ve used several libraries to talk to one another, parsed several formats of data, and provided users with some statistical tools as well. Not to say it’s a particularly elegant piece of programming, but I do believe it’s indicative of the broader range of possibilities in the digital humanities.

How I built this

I rely heavily on the Classical Languages Toolkit for this project, which has proven to be an invaluable resource (I previously used the CLTK in my post on the Digital Fragmenta Corporum Graecorum). The CLTK provides the texts, access to a lemmatizer, and a nice stop word creation tool that all go into this project. I also use the XML version of Lewis & Short’s A Latin Dictionary, available at the Perseus Digital Library’s GitHub page.

I also use Open Words, an open-source implementation of Whitaker’s Words, an online Latin to English dictionary hosted at Notre Dame. Open words provides succinct definitions for many many Latin words, which allows me to save space and fit both text and vocabulary on one page (most of the time). Open Words did not, however, totally fit my needs, as in some cases it was unable to identify a word properly, and it also does not return the full dictionary form of a word (i.e. if a user parses a word like “negavit” [he/she/it denied], it appears to be impossible to recover the dictionary form “nego” from Open Words).

I therefore combine definitions from Open Words with morphological information from the XML Lewis and Short. This results in an implementation that is possibly slower than it could be (it takes roughly 30 minutes to create a final product).

I used a script I had written for my earlier project on flashcards to parse the text and add definitions. It’s easy enough to store the definitions as a list, sort it alphabetically, and then store the sorted list as a string. The most challenging part of the project is getting the print layout right. I want to be able to fill up a page with text and vocabulary with minimal wasted space. As far as I can tell, there is no way to view or manage “soft” page beaks with Python-Docx, so identifying where and how to break up the text (which in turn will dictate how many vocabulary words need to be added to the bottom of the page) is the key to making the project worthwhile.

My solution is based more on rules of thumb than actual precision. I ran some tests to establish a) how many words take up half a page in Word (roughly 125) and, based on the lists of stop and high-frequency words I created, how many words in each block of 125 I need to look up (certainly this depends on the passage, but we’ll say it’s roughly 8 per page). With these dimensions, I write a function to take in the paragraphs supplied by the CLTK’s reader.paras() method, which uses regex to identify section breaks in a text. This code takes in paragraphs, cleans the text (the reader.paras() method instantiates a generator on a word-by-word basis) and creates ‘pages’ of 150 words using Python’s yield syntax to create a generator. I then perform my dictionary lookups and apply some Word formatting to create a page. When the pages are assembled, we have our complete document.

But wait — there’s a problem here. 125 words of prose is fine — with single spacing, a letter-sized Word page is ~300 words, so we have plenty of room for a line break and vocabulary. However, 125 lines of poetry is obviously not going to work (at single spacing and 12pt Calibri font, a Word document has 44 lines). So, the program has to have some way of reading text and preserving line breaks. We need to be able to parse each text to identify the line breaks, decide which texts are prose and which are poetry, and apply formatting appropriately.

Unfortunately, the CLTK’s texts from the Latin Library are in plaintext. This is good for some things, including rendering prose paragraphs, but won’t do for our purposes. The Perseus texts, which are stored as JSON files by CLTK, work fine. Luckily, the CLTK uses the same path structure in its file system that the Latin Library website does online. Because we can locate each file’s location on the hard drive, then, we can locate each online, and parse the HTML of the text, which does preserve line breaks.

When I ran some tests initially, I found that the stop list provided by the CLTK (and also the stop list created with the StopListCreator class) was not sufficient for my needs, constrained as I am by both attention span and page real estate. I therefore created my own, larger stop list, based on the cumulative frequency of words across the Latin Library corpus. My goal here isn’t to analyze topics or sentiment but to be able to read the text, so I focus instead on making sure rare words are glossed and common words are passed.

I lemmatized the entire Latin Library corpus to determine both the number of unique tokens (actual, usually inflected words, i.e. “est” or “fuit” or “re”) and the number of unique lemmata (words in dictionary form, i.e. “sum” for “est” and “res” for “re”). I then determined the lemma corresponding to each token and then how frequent each lemma’s corresponding tokens were in the overall corpus (I assume the Latin Library is broadly representative of all Latin literature — I may be incorrect to do so). For example, for the lemma “sum” (“to be” in Latin) I count the tokens “sum,” “es,” “est,” “sumus,” “estis,” “sunt,” etc. Here’s a sample of what I came up with:

Lemmas and token counts taken from the Latin Library corpus

For facing vocabulary I eliminated the words whose tokens make up 70% of all tokens in the LL corpus; that is, I should only be looking up words and putting them on the page if those words are among 30% least frequently used words in Latin. That should give someone like me, with a relatively strong background in Latin, enough help to keep going while keeping everything on one page.

I should finally note that once everything is set up, the program is relatively fast. It took about 10 minutes to create

Here’s the final result for Ammianus, book 14:

Is this thing on?

And some of Ovid’s elegiac couplets to show off poetry formatting:

Omnia mutantur, nihil interit

Refinements and Next Steps

Extending this to Greek texts would take some work, albeit much of it analogous to the steps here. The CLTK does have a library of Greek texts available though I have not used them. The lack of a concise, accurate Greek dictionary makes fitting text and vocabulary onto one page difficult (this was the real advantage of using Open Words for this project). I will publish a short post here and update the readme on GitHub if (/when) I adapt this project to include Greek texts.

One further refinement is using Lewis and Short to grab specific definitions based on the cited author. This will not be possible in all cases — the LS’ editors have a preference for classical authors (i.e. Cicero, Caesar, Vergil — the usual suspects) so if we are reading something like Ammianus, we’re less likely to be able to grab specific citations, and thus link specific word senses based on the context. (Note, however, that this is built into the online Perseus Latin Word Study Tool, which will mark matching entries with an asterisk if you click on a word in a source mentioned in the dictionary).

I am not a linguistic or pedagogical expert, and cannot comment on the effect this kind of reading has on students trying to acquire the fundamentals of reading ancient languages. When I was a student, I put long hours into flashcards and kept my paper dictionaries close by at all times. But I have found that what I like about reading Latin most is, well, reading Latin, and breaking to look a word up in the dictionary impedes reading progress and the flow of the narrative. I hope that a program like this can help other students or enthusiasts discover more texts they like reading.

I believe that the key to better Latin is reading more Latin. (I also believe there are benefits to reading Latin that extend beyond merely being a better Latinist.) This project (in conjunction with my flashcard program) will, I hope, make it easier for readers at all levels to sehr viel lesen.

--

--

Alex

Delivering the finest gymnosophistry west of the Indus. An occasional blog about projects I’ve undertaken, usually focusing on data and analytics.