Automating Anki with Python

Make Flashier Flashcards

Alex
Python in Plain English

--

I wrote a recent post about choosing some ancient text to read because, well that’s something I do.

In that post I chose a corpus of fragmentary texts available at the wonderful DFHG project, and ran some NLP analysis to determine which would be easiest to read (my Ancient Greek vocabulary isn’t what it once was, and it never was all that great, to be honest).

Unfortunately, I didn’t take my own advice (nusquam tuta fides, as they say). I decided to read the seventh speech of Dio Chrysostom, a celebrated orator of the 1st century AD.

The entire speech is divided into two parts. The piece opens with a charming and substantial vignette about the life of subsistence farmers living on the island of Euboea in Greece. Reading this section of the speech is (relatively) painless — the vocabulary and syntax are appropriate for a plainspoken farmer, and the hews close to its theme, with few occasions for rhetorical bombast.

About halfway through the speech though, Dio changes tack. The bombast that was held in check is let loose. The frame story is finished, and the orator proceeds to indulge himself with a lengthy and flowery discourse on the happiness of the poor in the ancient world. This section concludes with a lengthy vituperation on the dangers of prostitution. An excerpt gives a sense of this section:

Neither barbarian women, I say, nor Greeks — of whom the latter were in former times almost free but now live in bondage utter and complete — shall they put in such shameful constraint, doing a much more evil and unclean business than breeders of horses and of asses carry on, not mating beasts with beasts where both are willing and feel no shame, but mating human beings that do feel shame and revulsion, with lecherous and dissolute men in an ineffectual and fruitless physical union that breeds destruction rather than life. (Trans. Thayer)

This section was painful to translate (almost as painful as it is to read in English), and there is plenty more where that came from in the speech. The second half of the speech was, to put it mildly, a slog.

If one wants to read Greek, there is unfortunately no avoiding purple passages like these. And while there’s no foolproof method to get through them, having a firm grasp of vocabulary certainly helps things move along. As a student, I often found myself reading something, clicking over to the Perseus Greek Word Study Tool, looking up a word, and ending up with a purple link — meaning I’ve looked the word up before. I’d read about Mnemosyne (the goddess of memory and mother of the Muses) enough times, but evidently failed to offer the appropriate sacrifices.

The gold standard of flashcard apps is Anki, which uses spaced repetition to make learning more efficient. I’m not going to focus on the particulars of the spaced repetition system here. Instead, my focus will be on using Python to make Anki decks corresponding to a text. I want to be able to choose a text (Herodotus’ Histories, say), extract the necessary vocabulary from that text, and then push those vocab words to Anki.

There are five parts to this project:

  1. Generating a catalog of texts;
  2. Extracting vocabulary information from a chosen text;
  3. Filtering out common words and words that already exist in my Anki collection to avoid spam and duplicates
  4. Parsing the LSJ (the canonical Ancient Greek lexicon) to retrieve morphology, definitions, and example sentences for each vocab word;
  5. Creating an Anki deck corresponding to the vocab words extracted from each text.

I’ll go through each one of those steps in more detail. If you’d like to see the code behind this post, check out the github repo, here.

Part 1. Generating the catalog

There has been a tremendous amount of work done in the digital humanities over the last two decades, opening up a huge reservoir of texts and translations previously tucked away in university libraries. Two of the more monumental projects in the space are the Thesaurus Linguae Graecae (TLG), a collection of most texts written in Ancient Greek, along with the Perseus Project from Tufts University, which also collects digitized Ancient Greek and Latin texts. There is a github repo with lemmatized versions of the texts available here. That repo will be the source for the catalog we build. (It’s worth noting that because we’re using Perseus, the catalog won’t be totally complete. Perseus focuses on so-called “canonical” Greek lit, so if we wanted to do vocab for an author like Heliodorus, we’d be out of luck.)

The process to generate the catalog (and ensure we have access to the texts) is storage intensive, but not code-intensive. (Presumably we could also use raw URLs to parse these files; I cloned the repo to my local machine to have everything in one place.) Thanks to Giuseppe Celano, we have lemmatized (i.e. each word is parsed and recorded in its “dictionary form”) versions of each of these texts available as large XML files. We need to get a unique list of texts and then parse each XML file to get the unique lemmata in each text. This code parses each file and generates a data frame that will serve as our catalog:

So now we have our corpus, 910 texts in all. We can choose a text based on the author, title, or URN (think URL, but for stuff, not just webpages). Let’s pick one at random:

# sample texttext = corpus.sample(1, random_state=1234)
file = text.index[0] # get urn for search
# read text
file = ''.join(file.rsplit(':')[3:]) # parse file path
path = './LemmatizedAncientGreekXML-master/texts/' + file + '.xml'
tree = ET.parse(path) # element tree to parse XML
root = tree.getroot()

Part 2. Extracting words from a text

This part is a bit easier. We need to loop through the XML file and grab the lemmatized version of each word. We’ll append those to a list as we loop through the file. (This took some trial and error to find the correct level where the form we want is stored.)

In my case, the randomly selected text was Hesiod’s Theogony — a genealogy of the mythical Greek gods. This happens to work well for our purposes here. Hesiod being among the oldest Greek poets, the vocabulary is somewhat recondite, with many alternate spellings and rare words.

We now have a list, word_list in this code, that we can operate on.

Part 3. Cleaning the word list

Since our primary aim here is to produce a list of new vocabulary words, we should take care to avoid looking up and generating flashcards for common words (in other words, we don’t want to end up making cards for εἰμί or καί or ὁ every time we run the script). (This will also save us computing time — the script as currently written searches through each LSJ file to find definitions and citations. For long lists of words, this can take a while.)

I turned to Perseus’ frequency analysis tool to get a list of the most common words, which we’ll then use to filter our list of prospective vocab words. I set a coverage threshold of roughly 70% to filter out, meaning that we will exclude all Greek words, in order of descending frequency, that make up 70% of the available corpus. To generate the list, I searched the corpus for all words with more than 100 total occurrences, and stored the results in a pandas DataFrame.

# create frequency list according to perseus 
import pandas as pd
import requests
url = 'http://artflsrv02.uchicago.edu/cgi-bin/perseus/GreekFrequency.pl?author=&title=&genre=&displaymorethan=100&displaylessthan=100000&sortby=decreasingFreq&searchby=searchbylemma&language=NOT+English'page = requests.get(url)tbl = pd.read_html(page.text) #generates a list of all html tables table = tbl[0] # only one table on this page table# save the file for reuse
# table.to_csv('common_greek_words.csv')

Per the Perseus site, there are 5,293,231 distinct word forms, i.e. tokens, in the searchable corpus, and among these, there are 417,033 distinct words, i.e. types. The complete LSJ that I have has 116,501 entries (For reference, the English version of Wiktionary has approximately 520,000 word types, and the Oxford Latin Dictionary has 39,589 entries). That means that even if we discard the 3,525 words that make up our “most frequent” list, we still have plenty of words left to learn (and forget again). On the other hand, this approach might limit the utility of our flashcards. By filtering out so many common words, we run the risk of spending too much time memorizing hapax legomena (unique words that occur once and only once)

We’ll also want to take care of punctuation, which we can strip from each word. To make these adjustments, we can take our word_list and filter it accordingly.

# read in data from above, select words to excludeexclude = pd.read_csv('common_greek_words.csv') 
exclude = list(exclude.word.unique())
word_list = [word for word in word_list if word not in exclude] # strip punctuation from string import punctuation word_list = [word.strip(punctuation) for word in word_list]

Et voilà. We have our filtered word list. In my case (again, using Hesiod’s Works and Days), I’m left with 1,175 new words (from an original total of 5,431 lemmata). So we actually did get rid of ~78% of the words with this filtering algorithm. Now we can move to the next part — looking up these words automatically.

Part 4. Automating dictionary lookups

This is where the money is. Thanks again to Giuseppe Celano, whose work has been invaluable for this project, we have the entire LSJ as a series of XML documents that we can parse, much like we did with the corpus above. Yes, it is a lot of XML.

The LSJ is a monumental work of scholarship. While it reflects a certain scholarly bias toward Homer, Hesiod, and the canonical authors of the 4th and 5th centuries BC, it remains the single most useful resource for reading and understanding Ancient Greek. What makes the LSJ useful is the breadth of references the authors adduce for nearly every word, thus providing crucial context for words modern readers never hear, and only read in very particular literary contexts.

To make an effective flashcard, we want to include a few things — obviously we’ll start with the definition, the sine qua non for each word. We’ll also want to include example sentences to get an idea of the way each word was actually used.

To make this process work, we’ll take our word list as an input, loop through the files in the dictionary directory, and grab the attributes we care about. We append those to a Python dictionary to store values. We can save the output as a json file for storage, or continue to operate on it directly.

At this point, I haven’t optimized the search, and have gone with a brute-force approach, searching through each lexicon file for each word. This is by far the slowest part of the project. It takes Python about 16 seconds to parse the entire LSJ file tree. If, as is the case here, we have a list of ~1,000 words to look up, we’re looking at a wait time of some four hours, 23 minutes. Needless to say, it would save a lot of time to optimize this step — probably starting with a simple map based on the first letter of the word — but I’ll save that for rev 1.1.

Here’s the code to grab definitions and sentences for each word:

But now we have a dictionary that has all the information we want to include in our flashcards: the full dictionary form of the word (nominative singular, genitive singular, and grammatical gender for nouns; nominative singular for each gender for adjectives, first person singular, present, active, indicative form for verbs), example sentences ( 'citations', in the code), and translations ('senses' ). A sample of the resulting dictionary is here:

Dictionary of dictionary lookups

So we have the information we want stored in a dictionary or a json object. Next, we use this to build our flashcards.

Part 5. Finally, flashy flashcards

I’m going to make use of an external library for Anki called genanki. This library allows you to define a note style, create a deck, and add notes to the deck. You can then export the deck as a .apkg file, which Anki can read.

Anki is a remarkably extensible program, and there are tons of different options for customizing the look and behavior of your cards. I’m going to use a relatively simple model here. In this case, simple is probably better: following the minimum information principle, I want to display the dictionary form of the word on the front, and the definitions and example sentences on the back.

This code generates a simple model with three fields: front, back, and sentences. We need to add some CSS styling to the card to ensure that the cards are readable, type is of an appropriate size, the layout is centered, etc.

This is the code to complete our deck.

Now, we can import the the deck to Anki, and see how we did.

Is this thing on?

Not perfect, but pretty good. We’ve got a way to parse just about any text a casual reader would be interested in, generate a word list, filter out the words we (probably) already know, and create flashcards with the dictionary form and citations we need to understand each one. Certainly there’s more info we can include on the cards — dates, authorship for each sentence, etc. — but this program should be sufficient for casual readers’ needs.

Some final technical notes

I’ve written briefly about some of the shortcomings in this project as written now. There are some additional points I’d raise before ending this post.

First, it’s worth noting that this is intended to be a replicable process. I’ve wrapped the workflow up in a Python class to make this repeatable. We need only need one class, called TextParser in the code in the repo, which contains methods for each of the steps outlined above.

Second, the files used to build this project are large: the Perseus corpus, i.e. our catalog and word bank, is 3.72 GB. Not small. One potential extension of the project would be using urllib and xmltodict to parse the data online, without transferring everything to a local machine first. With this approach, optimizing the search algorithm becomes even more important (to avoid sending as many network requests as possible). Note that an alternative here is to simply copy and paste Greek text from some online source (Perseus, TLG, Wikisource, etc), then parse the words with base Python, and feed them to the TextParser class. The LSJ files are (relatively) lightweight at 317 MB.

It is not lost on me that what this project does is, ultimately, give us more work to do. While saving a ton of time on the front end, there’s no way to actually learn the vocab without hitting the books. Reading the classics is not easy; indeed, in the words of the philologist Ulrich von Wilamowitz-Moellendorff, “[t]o make the ancients speak, we must feed them with our own blood.” Absent a blood sacrifice to Mnemosyne, a votive of code (and disk space) will have to suffice.

--

--

Delivering the finest gymnosophistry west of the Indus. An occasional blog about projects I’ve undertaken, usually focusing on data and analytics.