From historical source to historical data

Where I offer a taste of just one of the low-hanging fruits acquired over my past five months of Python: The Sabbatical.

Digital history is slowly catching on, but, thus far, my impression is that it’s still limited to those with deep pockets – big, multi-year research projects with a web gateway and lots of institutional support, including access to computer scientist collaborators. Since I’m not in that kind of position, I’ve set my sights a bit lower, focusing on the low-hanging fruit that’s available to historians just starting out with python.

Yet much of this sweet, juicy, low-hanging fruit is, tantalizingly, still just out of reach. Undoubtedly you already know that one of the big impediments to digital history generally, and to historians playing with the Python programming language specifically, is the lack of historical sources in a structured digital format. We’ve got thousands of image PDFs, even OCRed ones, but it’s hard to extract meaningful information from them in any structured way. And if you want to clean that dirty OCR, or analyze the text in any kind of systematic way, you need it digitized, but in a structured format.

My most recent python project has been to create some python code that automates a task I’m sure many historians could use: parsing a big long document of textual notes/documents into a bunch of small ones. It took one work day to create it, without the assistance of my programming wife, so I know I’m making progress! Eventually I’ll clean the code up and put it on my GitHub account for all to use. But for now I’ll just explain the process and show the preliminary results. (For examples of how others have done this with Python, check out The Programming Historian, particularly this one.)

Parsing the Unparseable: Converting a semi-structured document into files

If you’re like me, you have lots of historical documents – most numerous are the thousands of letters, diary and journal entries from dozens of different authors. Each collection of documents is likely drawn from a specific publication or archival collection, which means they begin being all isolated in their little silos. If you’re lucky, they’re already in some type of text format – MS Word or Excel, a text file, what have you. And that’s great if you want just to search for text strings, or maybe even use regular expressions. But if you want more, if, say, you want to compare person A’s letters with person B’s letters over the same timespan, or compare what they said about topic X, or what they said on date Z, then you need to figure out a way to make them more easily compared, to quickly and easily find those few needles in the haystack.

The time-tested strategy for historians has been to physically split up all your documents into discrete components and keyword and organize those individual letters (or diary entries, or…). In the old days – which are still quite new for some historians – you’d use notecards. I’ve already documented my own research journey away from Word documents to digital tools (see Devonthink tag). I even created/modified a few Applescripts to automate this very problem in Devonthink in a rudimentary way: one, for example, can ‘explode’ (i.e. parse) a document by creating a new document for every paragraph in the starting document. Nice, but it can be better. Python to the rescue.

The problem: lots of text files of notes and transcriptions of letters, but not very granular, and therefore not easily compared, requiring lots of wading through dross, with the likelihood of getting distracted. This is particularly a problem is you’re searching for common terms or phrases that appear in lots of different letters. Wouldn’t it be nice if you could filter your search by date, or some other piece of metadata?

The solution: use Python code to parse the documents (say, individual letters, or entries for a specific day) into separate files, making it easy to hone in on the precise subject or period you’re searching for, as well as precise tagging and keywording.

Step 1:

For proof of concept, I started with a transcription of a campaign journal kindly provided me by Lawrence Smith, in a Word document. I’m sure you have dozens of similar files. He was faithful in his transcription, even to the extent of mimicking the layout of the information on the page with the use of tabs, spaces and returns. Great for format fidelity, but not great for easily extracting important information, particularly if you want, for example, June to be right next to 20th, instead of on the line below, separated by a bunch of officers’ names. (‘Maastricht’ and ‘London’ are actually a bit confusing, because I’m pretty sure the place names after the dates are that day’s passwords, at least that’s what I’ve seen in other campaign journals. That some of the entries explicitly list a camp location reinforces my speculation.) Of course people can argue about which information is ‘important,’ which is yet another reason why it’s best if you can do this yourself.

Aside: As you are examining the layout of the document to be parsed, you should also have one eye towards the future. In this case, that means swearing to yourself that: “I will never again take unstructured notes that will require lots of regex for parsing.” In other words, if you want to make your own notes usable by the computer and don’t already have a sophisticated database set up for data entry, use a consistent format scheme (across sources) that is easy to parse automatically. For example, judicious use of tabs and unique formatting:

Step 2:

Clean up the text, specifically: make the structure more standardized so different bits of info can be easily identified and extracted. For this document, that means making sure each first line only consists of the date and camp location (when available), that each entry is separated by two carriage returns, and adding a distinctive delimiter (in this case, two colons, ‘::’) between each folio – because you’ll ultimately have the top level of your structured data organized by folio, with entries multiple entries per folio (this is a one-to-many relationship, for those of you familiar with relational databases like Access). Cleaning the text can be easily done with regex, allowing you to cycle through and make the appropriate changes in minutes. Assuming you know your regular expressions, that is.

The result looks like this:

Note that this stage is not changing the content, i.e. it’s not ‘preprocessing’ the text, doing things like standardizing spelling, or expanding contractions, or what have you. Nor did I bother getting rid of extra spaces, etc. Those can be stripped with python as needed.

For this specific document, note as well that some of the formatting for the officers of the day is muddled (the use of curly brackets seems odd), which might equal loss of information. But if that info’s important, you should take care to figure out how to robustly record it at the transcription stage. If you’re relying on the kindness of others, ‘beggars can’t be choosers.’ But, if you’re lucky, you happen to have a scanned reproduction of a partial copy of this journal from another source, which tells you what information might be missing from the transcription:

Camp journal sample of above, from British Library, Add MS 61404, f. 45.

You probably could do this standardizing within your Python code in Jupyter Notebook, but I find it easier to interact with regex in my text editor (BBEdit). Your mileage may vary.

Step 3:

Once you get the text in a standard format like the above, you read it into python and convert it into a structured data set. If you don’t know Python at all, the following details won’t make sense. So go read up on some Python! One of the big hurdles for the neophyte programmer, as I’ve discovered over and over, is to see how the different pieces fit together into a whole, so that’s what I’ll focus on here. In a nutshell, the code does the following, after you’ve cleaned up the structure of the original document in your text editor:

Read the file into memory as one big, long string.
Perform any other cleaning of the content you want.
Then you perform several passes to massage the string into a dictionary with a nested list for the values. There may be a better, more efficient way to do this in fewer lines, but my beginner code does it in three main steps:
1. Convert the document to a list, splitting each item at the ‘f. ‘ delimiter. Now you have a list with each folio as a separate item.
2. Always look at your results. For some reason, the first item of the resulting list is empty (it doesn’t seem to be an encoding error), so just delete that item from the list before moving on.
3. Now, read the resulting list items into a python dictionary, with the dictionary key the folio number, and all of the entries on the folio as the value of that folio. Use the ‘::’ as the delimiter here, with the following line of code, a ‘comprehension’, as they call it. Notice how the strip and split methods are chained together, performing multiple changes on the item object in that single bit of code:
4. Now you use a for loop to parse each value into separate list items, using the other delimiter of ‘\n\n’ (two returns) between entries, using the string of the value (since otherwise it’s a list item and the strip and split methods only work on strings). This gives you a dictionary with the folio as the dict key, and the value is now a nested list with each of the entries associated with its folio as a separate item, as you can see with folio 40’s four entries:

That’s pretty much it. Now you have a structure for your text. Congratulations, your text has become data, or data-ish at least. The resulting python dictionary allows you to search any folio and it will return a list of all the letters/entries on that folio. You can loop through all those entries and perform some function on/with them. So that’s a good thing to “pickle”, i.e. write it to a binary file, so that it can be easily read back as a python dictionary later on.

Once you have your data structured, and maybe add some more metadata to it, you can do all sorts of analysis with all of Python’s statistical, NLP, and visualization modules.

But if you are still straddling the Devonthink-Python divide, like I am, then you’ll also want to make these parsed bits available in Devonthink. Add a bit of code to write out each dictionary key-value pair to a separate file, and you end up with several hundreds of files:

Each file will have only the content for that specific entry, making it easy to precisely target your search and keywording. The last thing you want to do is cycle through several dozen hits in a long document for that one hit you’re actually looking for.

That’s it. Entry of May 8th, 1705 in its own file.

The beauty is that you can add more to the code – try extracting the dates and camps, change what information you want to include in the filename, etc. Depending on the structure of the data you’re using, you might need to nest dictionaries or lists several layers deep, as discussed in my AHA example. But that’s the basics. Pretty easy, once you figure it out, that is.

Even better: now you can run the same code, with a few minor tweaks, on all of those other collections of letters and campaign journals that you have, allowing you to combine Newhailes’ entries with Deane’s and Millner’s and Marlborough’s letters and… The world’s your oyster. But, like any oyster, it takes a little work opening that sucker. Not that I like oysters.

Tags: Digital, Python

« Previous post

Shawn says : December 5, 2018 at 1:49 pm
I think it’s great you are bridging the gap between technology and history. I have a feeling many historians have a fear or lack of interest when it comes to digital history.

Reply

Trackbacks / Pingbacks

What EXACTLY Python does | Skulking in Holes and Corners - December 6, 2018