Given my impassioned proselytization of digital history, it’s not surprising that I received an email from a colleague asking a reasonable, yet hard-to-answer question: “what EXACTLY does Python do?” His name’s not important, but let’s call him “Bob.”
So if you’ve bothered reading any of my recent posts, and just wish I’d start from the beginning (or shut up, already), here’s my take on Python and computer programming, geared towards computer-knowledgeable people who haven’t programmed before. This is the perspective of someone who’s been a digitally-savvy humanities-type for thirty years, but has only recently dived into learning a computer programming language from scratch. Someone, keep in mind, who’s never gotten the official version from a computer science course, but who does teach a digital history course. So it’s therefore a focus on all the small and medium-sized things you can have your computer do for individual historical research – the digital historian’s low-hanging fruit.
What EXACTLY does a computer program do?
Python is a “high-level” general-purpose programming language. Which means, it uses a syntax that is ‘relatively’ readable, unlike the machine code used by the Matrix, and it can do ALL sorts of things with just about any kind of data you can imagine. But, unlike the Matrix, you can’t really bend or break the rules. At least I can’t.
So asking what ‘exactly’ a programming language like Python does doesn’t emphasize enough the fact that it does just about anything you could want, ASSUMING you have some digital information, AND assuming you want to do something with/to that data that can be translated into a series of discrete steps (an algorithm). In other words, there’s a lot of gold in them thar’ hills. But it’s only worth prospecting if you have (or can get) a lot of digital data, and need to repeat the same kinds of manipulations and analysis over and over: for each character/word/paragraph/document, I want to do X, Y, Z; if it says ‘blah’ I want to… (I just introduced you to for- and if-loops, in case you didn’t notice.) As long as the above two assumptions are met (nice to meet you, Boolean value = True), you’re golden.
Imagine that you were asked “what exactly does Microsoft Excel do?” Instrumentally, you’d probably reply that it allows you to type in numbers, edit the numbers with features like copy and paste, format the numbers in a variety of ways, perform mathematical calculations on the numbers, and print out or email the results. Or, you could answer the question by giving specific examples of projects you could do: Excel allows me to balance my budget; it lets me keep track of student grades, or my fantasy league stats. Good answers all, but just the tip of the iceberg. You could create an entire Excel spreadsheet that doesn’t have a single number in it. You could use Excel to lay out a particularly complicated table that Word’s Table feature chokes on, or that Word generally makes a pain in the arse to edit. You could, like I did twenty years ago, use Excel as a go-between when transferring Word notes into a note-taking Access database. (See! that Python post does have some utility after all.) You could even use Excel to rename a whole folder full of files based off another list, e.g. rename all those EEBO-TCP files you downloaded, which have names like A10352.txt. In a sense, that literally has almost nothing to do with spreadsheets – you delete the Excel file when you’re done, because it was just a means to an end. In other words, what “exactly” Excel does is limited most broadly by the features built into the application, but, more practically, it depends on the vision and technical expertise of the person using it.
Same with a computer programming language. But the Python canvas starts out with nothing at all on it, not even a grid of empty cells. Intimidating, but its tabula rasa lets Python deal with all sorts of different types of information, and allows you to manipulate them in an almost infinite number of ways, because you can keep building layer upon layer of functions on top of each other, drawing in data and inputs from all sorts of places, and sending them wherever you want. So, if you can think of a specific series of commands you invoke using some packaged software like Word or Excel or your web browser, you can (probably) recreate that workflow in Python. But it’s not just about replication; I see three main advantages to coding over a packaged piece of software:
- Coding allows you to totally control the inputs, the manipulations, and the outputs. Don’t like the way a program formats the results? Change it. Wish the summary feature would also include your favorite statistic? Include it. Hate how the program always does this one annoying thing? Fix it.
- Performing the task will be automated with Python, so you could run it thousands of times, in a few seconds (or more, depending on the complexity of the task), without getting carpal tunnel syndrome. Some packaged programs allow you to install or use macros to ‘record’ these repeated actions, and programming is like that, but so much more, because you have so much more control, and a much larger toolbox to draw upon. The result is a quantitative increase in speed, but that quantitative increase is really a qualitative advance – you can do things that you’d never do otherwise, because the computer does most of it for you.
- You don’t need to purchase, learn and maintain the dozen different programs that would be needed to to do (most of) the data analysis you can perform with Python. Nor do you need to worry about your specialized programs disappearing into history, as happens shockingly often unless your name is Microsoft or Apple or Adobe. Nor do you need to worry about radical changes from version to version. If you’ve experienced several generations of Microsoft Office or Blackboard or, heaven help you, mobile apps, you know what I mean. Python will definitely require you to keep an eye on changing Python library versions, and possible incompatibilities that might result. But at least with Python, you can create virtual environments so that you can run different versions of the same libraries on the same machine. And there are even web environments that allow you to set all of that up for the user.
Python, in other words, allows you to create workflows that automate a lot of the processing and analysis steps you’d have to do manually, possibly in many different programs. Input data, manipulate it in any number of ways, and then output it.
But it definitely comes with a learning curve, I won’t lie. A learning curve that is currently a lot harder for humanists, in my humble opinion, because we are not the primary users, and therefore we don’t have many pre-made, history-specific libraries with functions already designed. As a result, I’ve spent most of my time learning the fundamentals; the eye candy of maps and charts is much better served by dedicated libraries that require less programming knowledge and more domain expertise. My experience over the past several months has reinforced three important programming principles that I’m sure Programming 101 courses emphasize:
- Think through the nature of your data at each point in your code.
- Think through each logical step in your algorithm.
- Pay constant attention to the numerous syntax rules.
Ignoring any of the above will lead to runtime errors. Fortunately, you can google those error messages and probably figure out the problem, since it will likely be a violation of one of the above principles. Are you trying to use string methods on a list item? Are you expecting your iterator to be an iterable? Do you really want to put the print function inside that loop? Are you blindly copying an online example without considering how your case is different? And, as I keep telling my students, attention to details matter, whether it’s a missing close quote or a mismatched variable name. But the computer will usually tell you if you make a mistake, you learn these things over time (and build up working code that you can copy or use as functions), and practice makes perfect.
What EXACTLY does Python do for Scholars?
For scholars, a programming language like Python can help manipulate and analyze whatever text, numbers, image, video and sound information we have in digital form. Fortunately, it’s not as minimalist as all that, thanks to a few thousand free libraries (or modules) that you import into the basic Python package. The libraries do most of the hard work behind the scenes, i.e. most of the coding has already been done by the libraries’ authors, so you just plug and play with some standard commands, your specific data, the specific variables you create, and the specific commands combined in the specific order you choose. Figure out how to get your information into the code (i.e. the computer’s) memory in a specific format based on its structure (is it a string? a list? a dictionary?…), manipulate the resulting data (maybe you convert the string to a list, after replacing certain features and adding others), pass it on to the next block of code that does something else (now that it’s a list, you can loop through each item and count those that start with the letter ‘q’), do more things to the list and the count (maybe if the count exceeds a certain threshold, you send that word to another list), then pass any of those results on to the next block of code, until you end up with what you want. To give a simple real-world example: maybe you’ve started with a long string of text (like this paragraph here), you tokenize it into a list of individual words (deciding how you want to deal with punctuation and contractions), then count the words according to the letter they start with, then plot a histogram showing the frequency of each letter.
Python is also attractive to scholars because it’s free – insert joke about poor professor here. Its costlessness and open source ethos have encouraged hundreds of people to create free general libraries focusing on particular types of data and particular types of analysis, along with specialized domain libraries for astronomers, for geographers, for audiologists, for linguists, for stock market analysts… There is also a massive number of tutorials available online. Every year there are a dozen Py conferences held all over the world, and several hundred of the presentations are available on YouTube, including numerous 3-hour tutorials for beginners. You can also check out the Programming Historian website, which has numerous examples in Python. There are numerous cautions with programming (in our case, use Python 3+, not 2.7…), but there are lots of resources that discuss those. Plenty of ways to get started, in other words.
A final benefit of particular importance for humanities-types is Python’s ability to convert words into numbers, usually behind the scenes, and highlight patterns using various statistical properties of text. Such powerful text functions allow businesses to data mine tweets and online content; business demand seems to have juiced computer science research, leading to lots of advanced natural language processing (NLP) features, on top of those driven by the (older) linguistic and literary interests of academics. So, if you have a lot of digitized text or images and you want to clean/analyze them beyond just reading each document one by one, or manually cycling through search results, one… hit… at… a… time…, then Python is worth a look.
What EXACTLY does Python do for this Historian?
So here’s a list of the python projects I’ve been working on, and those I will be working on in the future. A few are completed, a few have draft code, a few have some ideas sketched out with snippets of code, and a couple are still in the fantasy phase. Many use the standard functions of off-the-shelf libraries, while others require a bit more custom coding. But they all should be viable projects – time will tell.
- Semi-automate a book index: Find all (okay, maybe most) of the proper nouns in a PDF document, along with which PDF page each occurred on, then combine them together into a back-of-the-book index format. If you don’t want to pay $1000 to have your book professionally indexed, you could use Word’s or Adobe’s indexing feature, which requires you to go through every sentence and identify which terms will need to be indexed. Or, you can get 85% of that with Python’s NLP (natural language processing) libraries, or you can import in a list of people/places/events and it will find those for you. As with all programs, things will get complicated the more edge and corner cases you try to address: Do I need to include a “he” on the next page with the full name on the previous page? Do I just combine together all consecutive pages into 34-39, or do I need to judge the importance of the headword to each page’s discussion? Tough questions, but this code will, at the least, give you a basis from which to tweak. And, judging from recent indexes in books published by highly-reputable academic presses, nobody cares about the ‘art’ of indexing anymore: some are literally headwords with a giant undifferentiated list of several dozen page numbers separated only by commas; some don’t even provide any kinds of topics, only proper nouns. Of course, the index may well die as more works are consumed digitally…
- Web scraping: Automate downloading content from a website, either the text on pages, entries in a list, images of battle paintings from Wikipedia, or linked files… Maybe automate the SPARQL query on historical battles I posted about awhile back. Or download a bunch of letters from a site that puts each letter on its own separate page. Maybe automate scraping publication abstracts from a website based off records in Zotero. (with a library like beautifulsoup)
- Web form entry: I’d like to create code that would automate copying bib info from Zotero (author, title, date, pages, etc.) and then paste it into our library’s online ILL form in the respective webform fields, which, of course, aren’t in the same order as the Zotero field order. That means a bunch of cutting and pasting for every request.
- Look up associated information on an entity (person, place, organization) with Linked Open Data, e.g. find the date of birth for person X mentioned in your text document via the web. (rdflib)
- [Added by request – get it?] APIs: Numerous institutional websites allow you to access their online data more directly through API, rather than using brute force scraping to harvest information from their pages’ HTML. Those websites have APIs (Application Program Interfaces) to allow more sophisticated downloading of information. You can search their site for ‘API’ to see if they offer it. (requests)
- Convert information into data: My two previous posts on AHA department enrollments and parsing long notes illustrate how this can be done with Python. Code that converts information into data is particularly important low-hanging fruit for historians, since we lack a lot of already-digitized datasets – this kind of code allows us to create them with our own data.
- Clean dirty OCR text: Correct OCR errors, and generally make a document more readable by humans and computers. A good, detailed description is Ryan Cordell’s Q i-jtb the Raven: Taking Dirty OCR Seriously. This requires a lot of hands-on work with the code, which I’ve been doing of late. E.g. find every occurrence of ‘Out-work’ and convert it to ‘outwork’ – so we can count them all the same way. Find every misOCRed ‘Mariborough’ and convert it to ‘Marlborough’ – there are big lists of common errors available to make this search-and-edit process a bit more precise. But since you’ll never guess every word that might be hyphenated due to line endings, finding every hyphenated word (like ‘be- siege’) and converting it back (to ‘besiege’) is easy enough, with regular expressions. You can even create a list of all the changes your code makes (i.e. create a dictionary with each ‘mistake’ and what it was changed to), if you want to audit the process. More difficult are the Questions of Capitalizations (case), especially when our 17-18C Authors liked to capitalize lots of Nouns, yet modern NLP uses Capitalization as one of its Clues for identifying Proper Nouns (Named Entity Recognition). Ideally you’d have this code as a series of functions so that you can run the various corrections across an entire folder of documents, based on what that source needs. Then, you could use more code to check for any other errors or documents that require special handling.
I’d argue that this is currently the most critical area for digital history, the bottleneck, in fact. So few historians have their own sources in clean full text, yet it’s also a very idiosyncratic thing to program, based on historically-variant word usage and widely-varying source genre vocabulary, as well as the sometimes-random errors derived from OCRing irregularly set type from a few hundred years ago. (It’d be great if OCR accuracy rates were 100%, but that ideal would seem to require having high-quality scans of the originals, which your average scholar does not have, and will likely never acquire, because we don’t actually own the originals). As a result, cleaning historical OCRed text is the area with the least amount of pre-made code available – big projects like EEBO and ECCO paid cheap foreign labor to type theirs by hand. We should also note that lots of social scientists, for example, talk about ‘preprocessing’ text, but by that they mean standardizing the spelling of text that’s been born digital, making everything lowercase, stripping out punctuation, etc. Historians need a lot of pre-preprocessing first because we are dealing with imperfectly OCRed text. And if we want to retain a cleaned text copy of the original, and not just atomize the text string into a list of word tokens, then it’s even more complicated. Suggestions welcome! Ted Underwood has provided some useful ideas in various venues dealing with big data (10,000s of texts), but his cleaning code on GitHub is a bit above my skill level.
- Quantitative analysis: Load your spreadsheet into the pandas library and run stats, plot charts… (pandas, matplotlib, seaborn, bokeh for interactivity)
- Make interactive visualizations, create interactive websites with your data, and so on. (bokeh…)
- Create a visual timeline, drawn from data extracted from a document. Possibly interactive. Haven’t explored this yet, but I will.
- Mapping: Quickly make maps, including small multiples, in whatever projection, with whichever features you want to display. Look up coordinates (aka geocode) and calculate spatial/topological relationships… It’s good to see that there are several international project teams working on historical gazetteers, and a number of groups have georeferenced some early modern maps as well. (geopandas, cartopy…)
- Network analysis: create a network ‘graph’ (diagram) of entities and relationships between those entities (nodes and edges), and measure the topological properties of the network, who are the hubs, the spokes, the nodes… (networkx)
- Relational database interactions with SQLite and MySQL. (sqlite3) I believe you can do the same with graph databases (triplets of subject-verb-object), but I haven’t really looked at those.
- Zotero: Update and manipulate your Zotero records. I can already read Zotero data into Python and hand it off to other libraries for analysis, as well as go in the other direction, e.g. mass update fields back in Zotero. But I’d like to create code that will take a PDF page (or two) of a book’s table of contents and enter a separate record for each chapter in the book into Zotero, automatically adding in all the other book info. (pyzotero)
- Dates: Automatically calculate duration between two dates, convert between OS and NS and other historical calendars, look up “last Tuesday’s” date when mentioned in a letter written on July 7, 1700 – you could probably automatically insert that date into the source document if desired. Maybe even do a quick calculation to see how few days you have left on your sabbatical… (calendar, datetime, dateutil, convertdate, dateparser, arrow)
- Textual analysis: This is a biggie for historians. Create a corpus of texts in your area; create a list of people, places and events to use for extraction; cluster works (or segments) together by the topic they discuss; see how often different authors/texts use particular words/phrases; identify which words tend to be collocated with which other words; keywords-in-context; sentiment analysis; etc. Did I mention fuzzy searching, finding words that are spelled similarly? Or finding words that are used in the same context as a given word? Maybe you want to analyze your own prose: which words/phrasings/grammatical structures do you overuse? (NLTK, spaCy, textacy, gensim, word embeddings…)
- Bibliometrics and Historiographical Analysis: From a secondary source, extract all the people, publications, places and time periods/dates mentioned, and graph/map them, before comparing them with other authors. Or analyze the sources cited in the bibliography – authors and affiliations, years of publication, languages, etc. The sciences have a lot of this already because they mostly publish journals and they’re in databases like Web of Science. This also ties into network analysis, especially if you want to look at citation networks.
- Analyze words/phrases from the 16m-book HathiTrust collection. There’s a website for that, but you can also download the data, or subsets at least.
- Genealogy: parse genealogical data and analyze. Would be interesting for royal lineages, and some work has already been done on that.
- Sound analysis. Haven’t played with these, but some people are into reconstructing soundscapes and the like.
- Image classification and analysis: Group together all the portraits of person X, etc. Haven’t played with these, though you have similar classification algorithms in Facebook, etc.
- Lots of full-fledged programs are also python scriptable. E.g. both ArcGIS and QGIS have python interfaces, which means you can automate many of the boring tasks you need to perform when making more sophisticated maps.
- Clean up your computer files – batch rename, copy, delete, convert, etc., with much more flexibility than Mac OS X’s Rename function.
- Automate lots of administrative school work. Create a syllabus class schedule that lists the day of week and date for each meeting during the semester, removing any holidays or other days off for that specific semester. I’ll be department chair next year, and there are lots of stats and reports on enrollment, assessment… that I’d like to automate: collecting the data from databases/surveys and then analyze them, without me having to manually repeat the entire process every time. A computer science colleague will have a student working on a course scheduler next semester – given a department’s faculty requests, the available timeslots, and a few dozen university and departmental scheduling requirements, come up with a schedule that meets all those criteria, or at least the most important ones. Now, we have to do this by hand (with Excel, but still), and it’s a real pain.
- Machine learning/AI: Python is also one of the main languages used for this new burgeoning field. For historians, that might mean classifying documents and topics, but I haven’t looked into it enough to think about how it could be used. Some of the above-mentioned libraries might well be superseded by machine learning libraries in the future, where things like neural nets figure out their own algorithms without rules being specified by the programmer. I think we’re already seeing a little bit of that with NLP.
And those are just a few of the things you can do with Python! So whatever data-related project you can think of, there’s probably a way to do it in Python. It’s not just automating the things that you find yourself doing on the computer over and over and over and over again. Just as important, what are the research questions that you want to ask, especially those that would require a lot of drudgery like counting and sorting and revising thousands of documents? Any software package that will answer that particular question for you will have its own learning curve, and there probably aren’t many people whom you could hire to do it for you, so you will probably be on your own. Whatever your question, there’s likely a way to combine the various Python tools together in a way that gets you the desired output.
But that’s not all. Using code also means:
- You can take whatever output and turn it into the input for another bit of code, and so on and so on. It is practically infinitely extensible.
- You can rerun your code but change a parameter, to see the difference it makes. ‘What-if’ exploration is super simple, and you can easily change a parameter anywhere in the workflow and continue the rest of your code with the new results.
- When you’re all done with your code, you can run it on another data set or text, or a whole folder full. And then you can compare the results.
- When you notice an intriguing pattern in one of your sources, you can quickly add another bit of code to explore it. Then you can look for that pattern in your other documents.
- You will also have a record of your process and method: which data you used for which analysis, how you cleaned the data, which settings and parameters you used, the order in which you performed your various steps, and so on. I’m guessing that more than a few historians would be unable to repeat, much less explain, how exactly they got the results they did. How faithfully, for example, do we record our computer-based research workflow? Some grant agencies are beginning to require recipients submit their data and workflow, along with their results. “Replicability” could even come to mean something in History.
This historian’s “killer app” for Python is a program that reads in a (primary or secondary) source from a text file, and then the code provides statistics on the words and phrases used, identifies rare terms that are unusually common in that document (compared to some corpus), extracts all the proper nouns mentioned, provides a statistical overview of their frequency (overall, and by section of book…), looks up information on the people (say, their nationality, age, etc.), then looks up the coordinates of mentioned places and maps them according to some criteria (by person who mentions the place, by where in the text it is mentioned, by what other things are mentioned around that place…). One output of all this could be tables or graphs of the entities in the text, word visualizations and the like. Another output could be automatically-created maps – not just maps of any of the above entities, but small-multiple maps that would locate a variable (say, siege duration) across four different theaters, and then another set of small-multiple maps that would similarly map the same variable by year instead. Might as well have it make a heat map while you’re at it. Several groups have already created web versions of some of these features (voyant-tools among them). But with your own code, you also end up with all these results in the code itself, which can be further analyzed with yet more code. Then, your code runs itself on a bunch of other documents, and includes comparisons between documents – which texts talk more about place X? This works for teaching as well as research. Imagine if you had a class where you assigned a source, had the students analyze it, and then put an interactive visualization of the document up on the screen to explore. This really wouldn’t be that hard – I already have almost all of the bits, and it’s just a question of chaining them all together. It will take a while to make sure the objects, logic and syntax are all copacetic, but hopefully it’ll be done in time for classes next fall.
If you’re not sure about diving into Python, I’d suggest you start by getting as many of your sources in digital form as possible. Scan, OCR, type. Then get yourself a decent text editor like Notepad++ or Text Wrangler/BBEdit and start learning regular expressions.
But the more historians we get writing Python code, the more history-specific code we can build off of. So let’s get started.
Where I offer a taste of just one of the low-hanging fruits acquired over my past five months of Python: The Sabbatical.
Digital history is slowly catching on, but, thus far, my impression is that it’s still limited to those with deep pockets – big, multi-year research projects with a web gateway and lots of institutional support, including access to computer scientist collaborators. Since I’m not in that kind of position, I’ve set my sights a bit lower, focusing on the low-hanging fruit that’s available to historians just starting out with python.
Yet much of this sweet, juicy, low-hanging fruit is, tantalizingly, still just out of reach. Undoubtedly you already know that one of the big impediments to digital history generally, and to historians playing with the Python programming language specifically, is the lack of historical sources in a structured digital format. We’ve got thousands of image PDFs, even OCRed ones, but it’s hard to extract meaningful information from them in any structured way. And if you want to clean that dirty OCR, or analyze the text in any kind of systematic way, you need it digitized, but in a structured format.
My most recent python project has been to create some python code that automates a task I’m sure many historians could use: parsing a big long document of textual notes/documents into a bunch of small ones. It took one work day to create it, without the assistance of my programming wife, so I know I’m making progress! Eventually I’ll clean the code up and put it on my GitHub account for all to use. But for now I’ll just explain the process and show the preliminary results. (For examples of how others have done this with Python, check out The Programming Historian, particularly this one.)
Parsing the Unparseable: Converting a semi-structured document into files
If you’re like me, you have lots of historical documents – most numerous are the thousands of letters, diary and journal entries from dozens of different authors. Each collection of documents is likely drawn from a specific publication or archival collection, which means they begin being all isolated in their little silos. If you’re lucky, they’re already in some type of text format – MS Word or Excel, a text file, what have you. And that’s great if you want just to search for text strings, or maybe even use regular expressions. But if you want more, if, say, you want to compare person A’s letters with person B’s letters over the same timespan, or compare what they said about topic X, or what they said on date Z, then you need to figure out a way to make them more easily compared, to quickly and easily find those few needles in the haystack.
The time-tested strategy for historians has been to physically split up all your documents into discrete components and keyword and organize those individual letters (or diary entries, or…). In the old days – which are still quite new for some historians – you’d use notecards. I’ve already documented my own research journey away from Word documents to digital tools (see Devonthink tag). I even created/modified a few Applescripts to automate this very problem in Devonthink in a rudimentary way: one, for example, can ‘explode’ (i.e. parse) a document by creating a new document for every paragraph in the starting document. Nice, but it can be better. Python to the rescue.
The problem: lots of text files of notes and transcriptions of letters, but not very granular, and therefore not easily compared, requiring lots of wading through dross, with the likelihood of getting distracted. This is particularly a problem is you’re searching for common terms or phrases that appear in lots of different letters. Wouldn’t it be nice if you could filter your search by date, or some other piece of metadata?
The solution: use Python code to parse the documents (say, individual letters, or entries for a specific day) into separate files, making it easy to hone in on the precise subject or period you’re searching for, as well as precise tagging and keywording.
For proof of concept, I started with a transcription of a campaign journal kindly provided me by Lawrence Smith, in a Word document. I’m sure you have dozens of similar files. He was faithful in his transcription, even to the extent of mimicking the layout of the information on the page with the use of tabs, spaces and returns. Great for format fidelity, but not great for easily extracting important information, particularly if you want, for example, June to be right next to 20th, instead of on the line below, separated by a bunch of officers’ names. (‘Maastricht’ and ‘London’ are actually a bit confusing, because I’m pretty sure the place names after the dates are that day’s passwords, at least that’s what I’ve seen in other campaign journals. That some of the entries explicitly list a camp location reinforces my speculation.) Of course people can argue about which information is ‘important,’ which is yet another reason why it’s best if you can do this yourself.
Aside: As you are examining the layout of the document to be parsed, you should also have one eye towards the future. In this case, that means swearing to yourself that: “I will never again take unstructured notes that will require lots of regex for parsing.” In other words, if you want to make your own notes usable by the computer and don’t already have a sophisticated database set up for data entry, use a consistent format scheme (across sources) that is easy to parse automatically. For example, judicious use of tabs and unique formatting:
Clean up the text, specifically: make the structure more standardized so different bits of info can be easily identified and extracted. For this document, that means making sure each first line only consists of the date and camp location (when available), that each entry is separated by two carriage returns, and adding a distinctive delimiter (in this case, two colons, ‘::’) between each folio – because you’ll ultimately have the top level of your structured data organized by folio, with entries multiple entries per folio (this is a one-to-many relationship, for those of you familiar with relational databases like Access). Cleaning the text can be easily done with regex, allowing you to cycle through and make the appropriate changes in minutes. Assuming you know your regular expressions, that is.
The result looks like this:
Note that this stage is not changing the content, i.e. it’s not ‘preprocessing’ the text, doing things like standardizing spelling, or expanding contractions, or what have you. Nor did I bother getting rid of extra spaces, etc. Those can be stripped with python as needed.
For this specific document, note as well that some of the formatting for the officers of the day is muddled (the use of curly brackets seems odd), which might equal loss of information. But if that info’s important, you should take care to figure out how to robustly record it at the transcription stage. If you’re relying on the kindness of others, ‘beggars can’t be choosers.’ But, if you’re lucky, you happen to have a scanned reproduction of a partial copy of this journal from another source, which tells you what information might be missing from the transcription:
You probably could do this standardizing within your Python code in Jupyter Notebook, but I find it easier to interact with regex in my text editor (BBEdit). Your mileage may vary.
Once you get the text in a standard format like the above, you read it into python and convert it into a structured data set. If you don’t know Python at all, the following details won’t make sense. So go read up on some Python! One of the big hurdles for the neophyte programmer, as I’ve discovered over and over, is to see how the different pieces fit together into a whole, so that’s what I’ll focus on here. In a nutshell, the code does the following, after you’ve cleaned up the structure of the original document in your text editor:
- Read the file into memory as one big, long string.
- Perform any other cleaning of the content you want.
- Then you perform several passes to massage the string into a dictionary with a nested list for the values. There may be a better, more efficient way to do this in fewer lines, but my beginner code does it in three main steps:
- Convert the document to a list, splitting each item at the ‘f. ‘ delimiter. Now you have a list with each folio as a separate item.
- Always look at your results. For some reason, the first item of the resulting list is empty (it doesn’t seem to be an encoding error), so just delete that item from the list before moving on.
- Now, read the resulting list items into a python dictionary, with the dictionary key the folio number, and all of the entries on the folio as the value of that folio. Use the ‘::’ as the delimiter here, with the following line of code, a ‘comprehension’, as they call it. Notice how the strip and split methods are chained together, performing multiple changes on the item object in that single bit of code:
- Now you use a for loop to parse each value into separate list items, using the other delimiter of ‘\n\n’ (two returns) between entries, using the string of the value (since otherwise it’s a list item and the strip and split methods only work on strings). This gives you a dictionary with the folio as the dict key, and the value is now a nested list with each of the entries associated with its folio as a separate item, as you can see with folio 40’s four entries:
- Convert the document to a list, splitting each item at the ‘f. ‘ delimiter. Now you have a list with each folio as a separate item.
That’s pretty much it. Now you have a structure for your text. Congratulations, your text has become data, or data-ish at least. The resulting python dictionary allows you to search any folio and it will return a list of all the letters/entries on that folio. You can loop through all those entries and perform some function on/with them. So that’s a good thing to “pickle”, i.e. write it to a binary file, so that it can be easily read back as a python dictionary later on.
Once you have your data structured, and maybe add some more metadata to it, you can do all sorts of analysis with all of Python’s statistical, NLP, and visualization modules.
But if you are still straddling the Devonthink-Python divide, like I am, then you’ll also want to make these parsed bits available in Devonthink. Add a bit of code to write out each dictionary key-value pair to a separate file, and you end up with several hundreds of files:
Each file will have only the content for that specific entry, making it easy to precisely target your search and keywording. The last thing you want to do is cycle through several dozen hits in a long document for that one hit you’re actually looking for.
The beauty is that you can add more to the code – try extracting the dates and camps, change what information you want to include in the filename, etc. Depending on the structure of the data you’re using, you might need to nest dictionaries or lists several layers deep, as discussed in my AHA example. But that’s the basics. Pretty easy, once you figure it out, that is.
Even better: now you can run the same code, with a few minor tweaks, on all of those other collections of letters and campaign journals that you have, allowing you to combine Newhailes’ entries with Deane’s and Millner’s and Marlborough’s letters and… The world’s your oyster. But, like any oyster, it takes a little work opening that sucker. Not that I like oysters.
“Shaving the yak” is a phrase used to describe the process of programming. It alludes to the fact that you often have to take two, or more, steps backward in order to eventually move one step forward. You want a sweater, so first you need to get some yarn, but to do that you have to… and eventually you find yourself shaving a yak. The reason why you even consider shaving a yak is that, once you’ve shaved said yak, you now have lots of yarn, which allows you to make many sweaters. This colorful analogy has a surprising number of online images, and even an O’Reilly book. It’s a thing.
I have been doing a lot of digital yak-shaving over the past four months. Come to think of it, most of my blog posts consist of yak shaving.
So if you’re interested in learning to code with Python but not sure whether it’s worth it, or if you just want to read an overview of how I used Python and QGIS to create a map like this from a big Word document, then continue reading.
At least until the lights (or internet) goes down.
I’m preparing my appeal to you faithful skulkers to assist me in my quixotic quest to create a more robust and usable dataset on early modern European wars. I envision keeping it simple, at least at the start, posting a series of spreadsheets (possibly on Google Sheets) with information about various aspects of early modern warfare. We don’t want to start from scratch, so I’ve downloaded the basic information on the period’s wars and combats (“battles”) from Wikipedia, via Wikidata queries using SPARQL. And I’ve been learning about graph databases in the process, which someone might consider a bonus.
Wikipedia??? Well, the way I see it, they’ve already entered in a lot of basic information, and many of the factual details are probably correct, at least to a first order approximation. So it should speed up the process and allow us to refine and play around with the beta data (say that fast three times) before it’s “complete,” however that’s defined.
Once the data sheets are up online, we can clean that information, I can collate it, and then we can open it to the world to play with – analyze, map, chart, combine with other data, whatever one’s heart desires. If someone wants to deal with the Wikipedia bureaucracy, they can try to inject it back into The Source of All Knowledge.
In the meantime, if you’re curious as to what someone with some programming skills and an efficiency-oriented mindset can create, you should check out the following blog post, wherein a data scientist collects all of the wars listed in Wikipedia (Ancient to recent), and then explores their durations and a few other attributes. Very cool stuff, and you gotta love the graphics. Check it out at https://www.gokhan.io/post/scraping-wikipedia/. And just imagine what one could do with more granular data, and possibly more accurate data as well! Hopefully we’ll find out.
In the meantime, here’s a real simple map from a SPARQL query locating all of the “battles” listed in Wikidata (that have location information).
I’ll let you decide whether Europe and the eastern US really were that much more belligerent than the rest of the world. To the Methodology!
Where I’m at now, after reading more on GIS, historical and Quantum. Here we have the beginnings of my Low Countries theater map, for operational military history.
Features include rivers, the (modern) coastline, capital cities, fortifications (fortresses and forts) by side of garrison, a light tracing of the pré carré fortresses in northern France, and, for kicks, the woods of northern Belgium traced from the Austrian Ferraris maps, c. 1770s.
And more to trace, e.g. from the Pelet 1837 atlas:
Still lots of work to do, cleaning things up and adding additional features, like army marches and camps. Eventually, I’ll even work up to Print Composer and stop taking screenshots.
But in the meantime, progress moves forward.
So let’s say you’ve become obsessed with GIS (geographical information systems). And let’s also posit that you’re at a teaching institution, where you rotate teaching your twelve different courses plus senior seminars (three to four sections per semester) over multiple years, which makes it difficult to remember the ins-and-out of all those historical narratives of European history from the 14th century (the Crusades, actually) up through Napoleon – let’s ignore the Western Civ since 1500 courses for now. And let’s further grant that you are particularly interested in early modern European military history, yet can only teach it every other year or so.
So what’s our hypothetical professor at a regional, undergraduate, public university to do? How can this professor possibly try to keep these various periods, places and topics straight, without burdening his (errr, I mean “one’s”) students with one damned fact after another? How to keep the view of the forest in mind, without getting lost among the tree trunks? More selfishly, how can one avoid spending way too much prep time rereading the same narrative accounts every few years?
Why, visualize, of course! I’ve posted various examples before (check out the graphics tag), but now that GIS makes large-scale mapping feasible (trust me, you don’t want to manually place every feature on a map in Adobe Illustrator), things are starting to fall in place. And, in the process, I – oops, I mean our hypothetical professor – ends up wondering what historical research should look like going forward, and what we should be teaching our students.
I’ll break my thoughts into two posts: first, the gritty details of mapping the Italian Wars in GIS (QGIS, to be precise); and then a second post on collecting the data for all this.
So let’s start with the eye-candy first – and focus our attention on a subject just covered in my European Warfare class: the Italian Wars of the early 16th century (aka Wars of Italy). I’ve already posted my souped-up timechart of the Italian Wars, but just to be redundant:
That’s great and all, but it really requires you to already have the geography in your head. And, I suppose, even to know what all those little icons mean.
Maps, though, actually show the space, and by extension the spatial relationships. If you use PowerPoint or other slides in your classes, hopefully you’re not reduced to re-using a map you’d digitized in AutoCAD twenty years earlier, covering a few centuries in the future:
Instead, you’ve undoubtedly found pre-made maps of the period/place online – either from textbooks, or from other historian’s works – Google Images is your friend. You could incorporate raster maps that you happen across:
Maybe you found some decent maps with more political detail:
Maybe you are lucky enough that part of your subject matter has been deemed important enough to merit its own custom map, like this digitized version of that old West Point historical atlas:
If you’re a bit more digitally-focused, you probably noticed a while back that Wikipedia editors have started posting vector-based maps, allowing you to open them in a program like Adobe Illustrator and then modify them yourself, choosing different fills and line styles, maybe even adding a few new features:
Now we’re getting somewhere!
But, ultimately, you realize that you really want to be your own boss. And you have far more questions than what your bare-bones map(s) can answer. Don’t get me wrong – you certainly appreciate those historical atlases that illustrate Renaissance Italy in its myriad economic, cultural and political aspects. And you also appreciate the potential of the vector-based (Adobe Illustrator) approach, which allows you to add symbols and styling of your own. You can even search for text labels. Yet they’re just not enough. Because you’re stuck with that map’s projection. Maybe you’re stuck with a map in a foreign language – ok for you, but maybe a bit confusing for your students. And what if you want to remove distracting features from a pre-existing map? What if you care about what happened after Charles VIII occupied Naples in early 1495? What if you want to significantly alter the drawn borders, or add new features? What if you want to add a LOT of new features? There are no geospatial coordinates in the vector maps that would allow you to accurately draw Charles VIII’s 1494-95 march down to Naples, except by scanning in another map with the route, twisting the image to match the vector map’s boundaries, and then eye-balling it. Or what if you want to locate where all of the sieges occurred, the dozens of sieges? You could, as some have done, add some basic features to Google Maps or Google Earth Pro, but you’re still stuck with the basemap provided, and, importantly, Google’s (or Microsoft’s, or whoever’s) willingness to continue their service in its current, open, form. The Graveyard of Digital History, so very young!, is already littered with great online tools that were born and then either died within a few short years, or slowly became obsolete and unusable as internet technology passed them by. Among those online tools that survive for more than a five years, they often do so by transforming into a proprietary, fee-based service, or get swallowed up by one of the big boys. And what if you want to conduct actual spatial analysis, looking for geospatial patterns among your data? Enter GIS.
So here’s my first draft of a map visualizing the major military operations in the Italian peninsula during the Italian Wars. Or, more accurately, locating and classifying (some of) the major combat operations from 1494 to 1530:
Pretty cool, if you ask me. And it’s just the beginning.
How did I do it? Well, the sausage-making process is a lot uglier than the final product. But we must have sausage. Henry V made the connection between war and sausage quite clear: “War without fire is like sausages without mustard.”
So to the technical details, for those who already understand the basics of GIS (QGIS in this case). If you don’t know anything about GIS, there are one or two websites on the subject.
- I’m using Euratlas‘ 1500 boundaries shapefile, but I had to modify some of the owner attributes and alter the boundaries back to 1494, since things can change quickly, even in History. In 1500, the year Euratlas choose to trace the historical boundaries, France was technically ruling Milan and Naples. But, if you know your History, you know that this was a very recent change, and you also know that it didn’t last long, as Spain would come to dominate the peninsula sooner rather than later. So that requires some work fixing the boundaries to start at the beginning of the war in 1494. I should probably have shifted the borders from 1500 back to 1494 using a different technique (ideally in a SpatiaLite database where you could relate the sovereign_state table to the 2nd_level_divisions table), but I ended up doing it manually: merging some polygons, splitting other multi-polygons into single polygons, modifying existing polygons, and clipping yet other polygons. Unfortunately, these boundaries changed often enough that I foresee a lot of polygon modifications in my future…
- Notice my rotation of the Italian boot to a reclining angle – gotta mess with people’s conventional expectations. (Still haven’t played around with Print Composer yet, which would allow me to add a compass rose.) More important than being a cool rebel who blows people’s cartographic preconceptions, I think this non-standard orientation offers a couple of advantages. First, it allows you to zoom in a bit more, to fit the length of the boot along the width rather than height of the page. More subtly, it also reminds the reader that the Po river drains ‘down’ through Venice into the Adriatic. I’m sure I’m not the only one who has to explicitly remind myself that all those northern European rivers aren’t really flowing uphill into the Baltic. (You’re on you own to remember that the Tiber flows down into the Tyrrhenian Sea.) George “Mr. Metaphor” Lakoff would be proud.
- I converted all the layers to the Albers equal-area conic projection centered on Europe, for valid area calculations. In case you don’t know what I’m talking about, I’ll zoom out, and add graticules and Tissot’s indicatrices, which illustrate the nature of the projection’s distortions of shape, area and distance as you move away from the European center (i.e. the main focus of the projection):
And in case you wanted my opinion, projections are really annoying to work with. But there’s still room for improvement here: if I could get SpatiaLite to work in QGIS (damn shapefiles saved as SpatiaLite layers won’t retain the geometry), I would be able to re-project layers on the fly with a SQL statement, rather than saving them as separate shapefiles.
- I’m still playing around with symbology, so I went with basic shape+color symbols to distinguish battles from sieges (rule-based styling). I did a little bit of customization with the labels – offsetting the labels and adding a shadow for greater contrast. Still plenty of room for improvement here, including figuring out how to make my timechart symbols (created in Illustrator) look good in QGIS.
After discovering the battle site symbol in the tourist folder of custom markers, it could look like this, if you have it randomly-color the major states, and include the 100 French battles that David Potter mentions in his Renaissance France at War, Appendix 1, plus the major combats of the Italian Wars and Valois-Habsburg Wars listed in Wikipedia:
Boy, there were a lot of battles in Milan and Venice, though I’d guess Potter’s appendix probably includes smaller combats involving hundreds of men. Haven’t had time to check.
- I used Euratlas’ topography layers, 200m, 500m, 1000m, 2000m, and 3500m of elevation, rather than use Natural Earth’s 1:10m raster geotiff (an image file with georeferenced coordinates). I wasn’t able to properly merge them onto a single layer (so I could do a proper categorical color ramp), so I grouped the separate layers together. For the mountain elevations I used the colors in a five-step yellow-to-red color ramp suggested by ColorBrewer 2.0.
- I saved the styles of some of the layers, e.g. the topo layer colors and combat symbols, as qml files, so I can easily apply them elsewhere if I have to make changes or start over.
- You can also illustrate the alliances for each year, or when they change, whichever happens more frequently – assuming you have the time to plot all those crazy Italian machinations. If you make them semi-transparent and turn several years’ alliances on at the same time, their overlap with allow you to see which countries switched sides (I’m looking at you, Florence and Rome), vs. which were consistent:
- Plotting the march routes is also a work in progress, starting by importing the camps as geocoded points, and then using the Points2One plugin to connect them up. With this version of Charles’ march down to Naples (did you catch that south-as-down metaphor?), I only had a few camps to mark, so the routes are direct lines, which means they might display as crossing water. More waypoints will fix that, though it’d be better if you could make the march routes follow roads, assuming they did. Which, needless to say, would require a road layer.
- Not to mention applying spatial analysis to the results. And animation. And…
More to come, including the exciting, wild world of data collection.
At the end of 2017, I’m able to catch my breath and reflect back on the past year. It was a digital year, among other things.
Most concretely, our History department’s Digital History Lab was finally completed. Two long years of planning and grant-writing, and almost 800 emails later, my quixotic labor of love is (almost) done! A generous anonymous donor gave us enough money to find a room one floor above our offices, and to find the money to stock it with PCs and iMacs, a Surface Hub touch-display, scanners (including a microfilm scanner and a ScannX book scanner), and a Surface Book tablet/laptop to pass around the seminar table and project to the Surface Hub. These tools will allow our undergraduate department to use the lab for a variety of projects: digital-centric history courses and digitally-inflected courses; independent studies and tutoring; faculty projects and internships; as well as public history projects with local museums. Not to mention the Skype-enabled Hub.
In the process of designing and overseeing the lab’s construction, I’ve learned a lot about institutional paranoia and the rules they necessitate, and how the digital humanities’ love of open-source software doesn’t play well with IT’s need for locked-down systems. So the lab had to forego many of the open-source tools used by digital historians and humanists. But I did try to provide the computers in the lab with commercial programs with similar features. The software includes:
- ABBYY FineReader for OCRing texts
- the standard Microsoft Office suite (including Access for relational databases)
- the standard Adobe Creative Suite, including Illustrator
- statistics software (SPSS and Minitab)
- EndNote (because we can’t install Zotero)
- Aeon 2 timeline software (for semi-interactive timelines like this)
- mapping software, including Google Earth Pro, ArcGIS, QGIS, Centennia and Euratlas historical digital maps, and MAPublisher to tweak geospatial data in Illustrator.
- OutWit Hub for web scraping and tagged entity extraction
- online software, such as Google Fusion Tables, Palladio, Voyant, etc.
- the machines also have Python, but I’m not sure about how easy it will be to constantly install/update new libraries and the like, given the school’s security concerns
- the department also has a subscription to Omeka, for our planned public history projects.
And there’s more to come. The anonymous donor made an additional donation which will allow us to replace that retro chalkboard with a 90″ monitor display. As well as purchase a few other software packages, and even a reference book or two. All the tools you need to do some digital history. And build a digital history curriculum for our undergraduate majors.
The DHL will be the centerpiece of our department’s new foray into digital history. Since we’re an undergraduate institution, our goals are modest. Having just taught the first iteration of my Introduction to Digital History course, it’s pretty clear that having undergraduates mess with lots of open-source package installations – much less try to learn a programming language like Python – would’ve been a nightmare (especially since I’m just learning Python myself). So our textbook, Exploring Big Historical Data, didn’t get as much use as I’d initially planned. But we did spend some time looking at the broader picture before we dove into the weeds.
And to make sure the students understood the importance of kaizen and the “There’s gotta be a better way!!!” ethic, I beat them over the head with the automation staircase:
As a result, the students were introduced to, and hopefully even learned how to use at least a few features of, the following tools:
- Adobe Acrobat automation
- Excel (don’t assume today’s college students know how to use computers beyond games and social media)
- MS Access
- OCR (ABBYY FineReader and Adobe Acrobat Pro)
- Regular expressions
- Google Sheets and ezGeocode add-in
- Google Fusion Tables
- Stanford Named Entity Recognition
- OutWit Hub
A digital smorgasbord, I realize, but I tried to give them a sampling of relational databases, text mining, and mapping. Unfortunately, we proved again and again that 60%-80% of every digital project is acquiring and cleaning the data, which meant there wasn’t as much time for analysis as I would’ve liked. And, to boot, several of the tools were extremely limited without purchasing the full version (OutWit Hub), or installing the local server version on your own computer (Stanford NER) – did I mention students had problems installing software on their own machines? But, at the least, the students were exposed to these tools, saw what they can do, and know where to look to explore further, as their interests and needs dictate. I’d call that an Introduction to Digital History.
Fortunately, I was able to play around with a few more sophisticated tools in the process, relying on the Programming Historian, among other resources:
- Vard 2 and GATE (cleaning up OCRed texts)
- MALLET topic modeling
- Gephi network software (Palladio also has some basic network graphing features)
- VOS Viewer for bibliometrics – if only JSTOR/Academic Search Premier/Historical Abstracts had the bibliometric citation datasets that Web of Science does (yes, JSTOR’s Text Analyzer is a start, but still…)
- Edinburgh geoparser
- Python (also with the help of Automating the Boring Stuff with Python).
So now I’ve at least successfully used most of the tools I see digital historians mention, and have established a foundation to build future work upon.
So, what are my resolutions for 2018?
More of the same, but applied toward EMEMH!
More digitalia – adding a few more toys to Eastern’s Digital History Lab, training the other History faculty on some of its tools (Zotero and Omeka, for starters), and practicing a bit more with GIS. And figuring out a way to efficiently clean all those 18C primary source texts I’ve got in PDFs. And, just as mind numbing, creating shapefiles of the boundaries of early modern European states.
More miltaria – I’m teaching my European Warfare, 1337-1815 course again this Spring, and will try to figure out a way to have the students’ projects contribute towards an EMEMH dataset that will eventually go online.
And did I mention a year-long sabbatical in 2018-19, so I can finish the big book of battles, and start the next project, a GIS-driven operational analysis of Louis XIV’s campaigns? Yeehaa!
So here’s to wishing your 2018 might be a bit more digital too.