Where the historians are, 2017
“Shaving the yak” is a phrase used to describe the process of programming. It alludes to the fact that you often have to take two, or more, steps backward in order to eventually move one step forward. You want a sweater, so first you need to get some yarn, but to do that you have to… and eventually you find yourself shaving a yak. The reason why you even consider shaving a yak is that, once you’ve shaved said yak, you now have lots of yarn, which allows you to make many sweaters. This colorful analogy has a surprising number of online images, and even an O’Reilly book. It’s a thing.
I have been doing a lot of digital yak-shaving over the past four months. Come to think of it, most of my blog posts consist of yak shaving.
So if you’re interested in learning to code with Python but not sure whether it’s worth it, or if you just want to read an overview of how I used Python and QGIS to create a map like this from a big Word document, then continue reading.
Taking Advantage of Sabbatical
On a meta level, I knew that if I were ever to make any sweaters with computer code, I would have to shave that particular yak this sabbatical. Multiple factors converged:
- First, this year ‘off’ would be my one opportunity in the next seven years to delve into Python and to learn whatever else would set up my research and (digital history) teaching. Several years ago, I remember reading a digital historian’s blog post on the cool stuff he was doing with some advanced digital tool, and I thought, “Yeah, but who has time to do all that?”
[Thumbs pointing at self]: “This guy.”
Admittedly, I have Marlborough’s Big Book of Battles (working title) to finish, but some of the coding I learn can help with that. Ultimately, it’s about priorities, and, honestly, the world will not end if it’s denied one more book on Marlborough within the next year. And the book will be a lot better with the Python tools I’m learning.
- Second, and fortuitously, beginner-friendly Python has arrived, literally within the past few years. Thanks to Anaconda, Jupyter notebooks, oodles of websites (including the programminghistorian.org), dozens of books, and dozens of YouTube tutorials from recent PyCon, PyData, PyLondon, PyBerlin… conferences, there is a critical mass, and you can learn much of it on your own, even if you don’t take any of the available online courses.
Don’t get me wrong: learning Python has still been challenging – the most frustrating part is getting everything set up, whether it’s installing the right Python version in the right directory (tip: start with a clean install of Python 3 using Anaconda), installing third-party Python libraries visible to your Anaconda installation (tip: do it from the command line and activate the conda environment first), or getting your data into a usable format for analysis (see below). It also requires a learning process to move from the basic tasks you can perform with a Jupyter tutorial downloaded from GitHub (or from a website or book), to more realistic, and therefore more complicated, customized tasks that you really want to perform with your data, right now. I wouldn’t have been able to do much of what I wanted in Python, certainly not within a few months of beginning to learn it, without the help of my programming wife and a Python-literate colleague in Eastern’s English Department (Ben Pauley). So there’s definitely a learning curve.
- Python has become the go-to language for text cleaning, natural language processing, visualizations (along with R), and, increasingly, basic machine learning. And did I mention it also has mapping libraries like geopandas? Python will do practically any academic gruntwork a humanist can imagine computers would do, and then some. And I say that as a humanist with a bit of an imagination.
- Having taught my Intro to Digital History course once already, I learned that online tools are fleeting and fragile, and will only do a third of what you want them to do. You can usually find small, niche programs that give you the ability to do another quarter of what you want: things like Vard2 and GATE (cleaning OCRed text) and GRAMPS (genealogy) and OutWit Hub (web scraping) and Stanford NER (named entity recognition of text) and Edinburgh Geoparser (NER/mapping). They can be very useful, but they can also become outdated (especially the free online ones), and you may well have as much trouble installing them on your local machine as something like Python. So given the fact that Python will do almost everything just about every other dedicated software package will do (again, I’m talking data science and academic tasks here, and it will require programming on your part), and since everything in Python is free, and since there are so many Python libraries that will perform most of these tasks, why struggle installing a dozen different programs and learn each of their quirks, just to do one specific thing in each? One program to scrape data from a website. Another program for cleaning data. One for doing quantitative analysis. Another for qualitative analysis or natural language processing. Another for visualizing your results in a fancy chart (Excel does not count). Another for creating a network graph of your data. Yet another program to map your data. Still another to create an interactive visualization that you can explore… Python can do them all. Don’t get me wrong: sometimes it will actually be easier to just install a specialized program. But you’ll only know after you’ve tried to recreate part of it in Python.
So, after playing around with most of the other programs, I decided to focus my struggle on installing and learning one tool (Python), and then use its hundreds of libraries to help me do any number of analyses. Python’s a pretty big yak, but there’s a lot of multicolored yarn on that beast. And, you can always rely on your text editors, Excel, and other niche programs to fill in any gaps, until you learn more about Python and its libraries.
Thus: means + opportunity + motive = learn Python. I already have plans for a few dozen projects for Python to automate, everything from simple time-savers like calendar look-up (“Last Tuesday we marched…” – what date was last Tuesday?) to analyzing my prose to analyzing primary (or secondary) sources to semi-automating the creation of a book index, to the topic of this post: a map of US History departments. And that list doesn’t even include various service tasks I’ll need to perform once I become department chair next year.
But back to maps. Last month I read a recent article by John D. Hosler, “Pre-Modern Military History in American Doctoral Programs: Figures and Implications” in the April 2018 issue of Journal of Military History. In it, he argues that there is a dearth of US doctoral programs that teach medieval military history. I was curious about this, and I was looking for another dataset to play around with in Python. Since I’m still in the advanced beginner stage of Python, I used this cartographical sweater project to force myself to learn some basics of Python text parsing. Sabbatical allows you to give that yak a bit closer of a shave than you could during a regular school year. (Ok, that metaphor is beginning to sound a little weird now…)
The steps I took illustrate the frequent walking-backwards-in-order-to-move-forward process that is shaving the figurative yak:
- After reading Hosler’s piece, I thought to myself, “Hmm, that sounds like early modern European military history, but EMEMH is probably worse off.”
- Then thought to myself, “But I don’t want to replicate his work for EMEMH. But maybe I’ll just map his schools! That should be straightforward enough!” Famous last words.
- Asked John Hosler for his data. He kindly obliged.
- Realized his data isn’t in a very computer-friendly format. The most complete dataset was published in the original article:
It looks fine on the printed page, but it’s problematic for reuse. It’s a textual table, so it’s not easy to convert to csv; multiple schools are in each cell rather than a separate row for each school; the important information – which school has a medievalist – is indicated by formatting (bold), rather than as a separate column with a yes/no value, and Excel has a tough time finding sorting/filtering on bold formatting; finally, there are abbreviations which are quite understandable to humans, such as “IU-Bloomington”, but these are not their standard names, which means it wouldn’t be easy to match them to another list of schools algorithmically.
Lesson learned? Historians (practically all of us) are horrible at preserving our data, and few of us have been trained in how to present data that can be easily digested by computers. Nowadays, people refer to that as “tidy data” (pdf link here).
- So I decided that, rather than enter it all by hand, the best way to find a list of all the US history programs would be to check the AHA website, the AHA being the flagship organization for American historians. I discovered that the data isn’t, in fact, available online (being migrated). I emailed the person in charge and asked for a download of the dataset. Was told that the data isn’t in a very usable state now and can’t really be extracted from the database (but they’d think about maybe making it downloadable in the future). I was, however, generously given the next best thing – a Word document of the AHA’s 2017 department directory, which includes all sorts of self-reported info on each department that pays to have its info included, about 600 departments in total.
Lesson relearned for the umpteenth time? Historians don’t think in ‘dataset’ terms, and we’re not very good at constructing and managing them. But we do like to share, which is a start.
- Looking over the Word file, I realized that the 361,000-word text document wasn’t particularly usable in its current form, at least for much more than looking up a person or school. But it does have lots of data, and most of that data is semi-structured:
- So I spent some time learning the Python that would let me import that structured data into a Python data object. I figured out that I need a Python dictionary (school as dictionary key and info on school as dictionary value), and that I could use regex to parse the data, though, again, bold fonts don’t help much.
- First step was to convert it from Word to plain text, which could be easily imported into Python. But before that, I cheated and used my knowledge of Word’s advanced find-replace based off formatting (school names were in a larger font size) to add a delimiter character at the beginning of each school name – that would make it easier to separate out the schools once in Python.
- Once I started trying to organize the data in Python, I realized that I actually needed a nested dictionary in Python. So then I spent time (ok, my wife’s time) figuring out how to import items at different layers into nested dictionaries, with nested lists within the nested dictionary values. This is where it got complicated, but we figured it out after several hours over a couple of days. Then I could expand from the simplest test case to the several variables I was interested in, using regex as needed. Part of the code looks like this:
- I spent time learning how to clean the data further – schools with multiple values in a field, converting lists to numbers, etc. Real data is real messy.
- I then read the nested dictionary into Python’s pandas library. Then I did more cleaning.
- But if I want to map these, I’ll need their coordinates. So I geocoded the list of schools to get their respective latitude and longitude coordinates. This can be done in Python, but I used Google Sheets’ ezGeocoder because I was more familiar with it.
- Then I combined those coordinates with the other pandas data. I haven’t perfected this concatenation yet in Python, but even with some extraneous rows to clean, it was still faster than doing it by hand for 600 schools.
- After it was pretty clean, I exported the resulting pandas table to Excel, to finish off the cleaning (haven’t yet figured out how to convert a list in a pandas cell into a numeric).
- Saving it to csv, I then imported the data into QGIS – haven’t had time yet to explore Python’s geopandas library. So I mapped some of the data in QGIS.
- In the process, I realized that I needed to add another field to the dataset, which took me back to Python, to add another field to parse, and then clean, and then export it all again to Excel, and then to QGIS. Note that this process is easy with Jupyter notebook’s pipeline – all you have to do is add the extra bit of code and then run the tweaked code again on the original dataset. It will redo all the importing and cleaning and exporting automatically – just make sure you’re not overwriting any cleaning you did in Excel after pandas! The revision process would be even easier if you eliminated steps 12-15 above, by automating the geocoding and final cleaning procedure in Python and mapping it in geopandas. (Though QGIS will give you more customizability.)
So that’s how I got to the map shown above. In reality, it took more than just a few weeks to make the map (I was doing other things, you know). More to the point, I’d already spent a few decades looking at maps and being a “power user” of computers, spent a few semesters taking cartography and statistics courses in grad school, and spent weeks learning QGIS as well.
Is learning Python a lot of work? Yes, depending on what you want it to do.
Is it worth it? Well, I guess that depends on how bad you want to map data. And analyze data. And chart data. And get new data to map and analyze and chart… In the past, historians could get away with prose alone, and maybe the occasional hand-drawn map. But, if I put on my prediction hat, I think more and more historians will not only see how powerful these tools can be, but will also realize that their arguments will increasingly be tested by historians who bring more data to the party, and who use digital tools that allow them to be more consistent, and look at more data than is possible with eye and hand alone. And, as more primary sources become available in digital form, and as more unstructured text becomes readable by machines, there will be less of an excuse not to use digital tools. I know, I know, all this has been predicted before, back in the 60s. But this time it may really be different. Natural language processing, the ability to extract information from masses of digitized text, might well be the difference.
So if you are still code-curious after all the above, I present you with general thoughts I’ve learned over the past four months, many drawn from other guides on Python. And then, more maps.
My “tips on learning a first programming language, by someone who’s barely learned Python” include:
- The easy Python code is readily available in books, online, and in Jupyter notebooks you can freely download. If you want to do basic stuff, it’s not that hard. Unfortunately, you’ll probably not be particularly interested in the basic stuff. But you should be patient and start with the baby steps. I wasn’t patient at first, but I ended up having to take those baby steps all the same. Baby steps include understanding the different object types (like strings, integers, lists, dictionaries), understanding the basic Python syntax (such as common abbreviations and what they stand for), and some computer concepts like methods and arguments.
- There’s a huge difference between thinking you understand what somebody’s finished code is doing, and creating comparable code yourself from scratch.
- Code will, almost always, tell you if it doesn’t work. (Except for regex, which can “fail silently.”) And when it fails, it’s probably doing exactly what you told it to do, not necessarily what you wanted it to do. As a result, programmers talk about the process of “failing to success” – i.e., each error message brings you closer to code that works. It’s humbling and sometimes frustrating, but at least the computer tells you if you did something wrong. Doing things by hand, even with Excel, rarely gives us that safety net.
- Consult different resources. There are many books, websites, blogs and videos that teach specific Python features and libraries. But some are better than others, and some will discuss techniques you are more likely to use. So poke around. Once you start to feel a little more comfortable with the basics, then look at the online documentation for Python and its various libraries. Those pages will let you know what exactly you can do with each method, what ‘arguments’ and parameters are available.
- If you really want to understand code and modify it to your own purposes, you need to do it the hard way. Which means learn by typing the code out yourself. That’s the only way it will stick.
- Like everything else, it gets easier the more you do it. They call it a learning curve for a reason, because the slope gets less steep at a certain point. I think I’m starting to see that flattening curve ahead of me. But that required me going back and rereading chunks of some of the introductory chapters more than once, when I’d get stuck on an intermediate-level task.
- You will undoubtedly get frustrated when you find some sample code that probably does what you want, but it starts with a different type of object. This is why it’s important that you understand the basics of the language, e.g. the different types of objects, so that you can modify the sample code to fit your specs. So spend some time early on learning how to read different types of files (a text file, a Word doc, a PDF, a csv file…) into Python as a string or list or dictionary or what-have-you. Even better, realize that you’ll ideally read most of your data in as either a csv or txt file, excluding images and sound, of course. So figure out an easy way to convert all your other files to one of those two formats. Tragically, the two most commonly-used file formats in the humanities, Word documents and PDF files, are the worst when it comes to readability by other software. So do what I did with the AHA Word doc – convert it to txt. Ditto for Excel files – convert to csv.
- If you deal with text files, keep an eye out for encoding issues. If you see weird gobbledygook in your text, strange characters with slashes, upside-down question marks and what-not, that probably means you have an encoding issue. Encoding is it’s own universe, but the best advice is to always save your files (csv, txt) in your text editor (NOT Word) as plain UTF-8 encoding. Do not use BOM, do not use UTF-16 or Western, do not use anything else. And don’t assume that just because the file was at one point in UTF-8, it’s still in UTF-8, particularly if you’re switching between Excel and a text file. Saving the original file in UTF-8 is the easiest way to get usable data into Python, including text in non-Latin alphabets.
- When you get stuck trying to do something in Python, you can rotate your Python projects: get stuck on one project, move to another until you get stuck on that, and then on to another. You’ll probably finish at least a couple of those, and you’ll likely learn things that can then be applied back to a project you were previously stuck on. And there’s always a Python community (including on Stack Overflow) for the harder stuff. Hopefully I’ll post on my GitHub site the Jupyter notebooks as I complete them.
- Python allows you to rerun your code on the original data every time you make a change to either the code or the data. You can ‘show your work’ – versus the response that I’ve received from two EMEM historians over the past couple decades when requesting the data their summary tables were based on: “Don’t have it anymore.” Certainly not at the level of a certain book on American gun culture, but still, we should do better.
That’s the gold standard of replicability and transparency: a) you, or others, can rerun your old code on your old data and get the same results you did previously; b) you, or others, can run your old code on new data and get results compatible with your old methodology, or maybe discover that you need to redo a) above; c) you, or others, can run new code on your old data and update your results; and d) you, or others, can run new code on new data, for new results. Consistently. Easily.
- You can reuse chunks of your code in other projects, without having to reinvent the wheel. That includes sharing it with other people. At first, the hardest part of repurposing someone else’s code will be figuring out how to use your data with their code. Once you figure that out, things get easier.
- Some of the Python libraries are built off each other, using similar syntax. This is particularly true for pandas, a Python equivalent to Excel, or maybe even Access. So, at some point, be sure to learn the foundational packages like matplotlib and NLTK and pandas.
- Newer Python libraries in the same domain tend make it easier to do the same things, and add additional features. For example, matplotlib (graphing) and NLTK (text processing) are old libraries that are ultra-customizable, and therefore can have some complicated syntax. But there are new libraries, like seaborn or textblob/spaCy, that make it easier for you to do the same thing with simpler syntax.
- But if you find yourself with a deadline, or feel like you’re spending more time on a Python project than is necessary, don’t be afraid to revert to a better-known method or tool. But be sure to do a quick online search to make sure someone hasn’t already invented the wheel for you. In the case of Hosler’s data in Table 1, it was easiest to just manually enter those 51 schools with a faculty member in medieval history. Cleaning data can be time-consuming.
- The more you explore Python’s features, the more you’ll start seeing connections to other things, to other potential projects. And the more you’ll understand how packaged software does the things it does, because now you can do it too, but in Python.
So How About Some More Maps?
We’ll start with a boring map with one dot per History department. Using QGIS 2.18’s Print Composer, you can use the same layout and just change the details, giving each of the 602 schools an equally-sized point symbol.
Since the AHA directory had a ‘Degrees offered’ category that lists the type of degrees offered (BA, MA, PhD…), I used Python to extract each of those degrees into its own field and mapped them as well, using a rule-based style to only display History departments that offer a PhD in History, about 150 programs.
Or, we could go back to the initial starting question – where are all of those medieval (military) historians teaching? Given my description of Hosler’s formatted data above, I had to add his information (school and degree offered in medieval) to the AHA dataset by hand. Fortunately, it wasn’t that big of a chore – only 52 schools, 39 of which offer Ph.D.s. Mapping that data results in the following:
Interesting. Initially, I had guessed they’d be grouped more in the southern US, but that doesn’t seem to be the case. One could explore further, looking at percentages by region and the like, but it does appear that the ‘interest’ (if that is even relevant to faculty lines) in pre-modern is more pronounced in the ‘older’ part of the US.
Even More Possibilities
Overall, the Python data extraction was a bit messier than I’d like, e.g. a few schools were missed in my initial passes, and some schools didn’t include all their data. One or two major programs didn’t even have their info in the AHA directory. You can probably see one or two other anomalies in the maps above. All of which should serve as a reminder: it’s always better to start with a data dump from a database rather than extract from text, if you can. But sometimes you can’t, so you make do.
Given all the structured information that’s in the AHA directory, you could look at all sorts of things if you were so inclined. And if you took the time to double-check your extracted data. A few examples of questions you might ask of this expanded dataset:
- Map by number of faculty, i.e. size of department.
- The distribution of the full-time faculty by rank in 2017, i.e. the percent at each rank of assistant, associate, full…
- Faculty areas of specialization. Are there generally-accepted names for these categories? Is there a pattern according geographical location, or to the school where each faculty got their PhD from, or the year their PhDs were granted? Do departments have more than one specialist in a particular area? Which? Where? What period?
- Schools where the full-time faculty in each department got their PhDs from, as well as when they got their PhDs. Patterns? Maybe even make a network graph that would draw lines for a given school – where did all the faculty at school X get their PhDs from? Which schools are the main feeder schools (not that hard to guess), and have these changed over time? Do we see an overall change in schools based by year of PhD?
- If you had some way of ranking each school (e.g. by Carnegie classification or US News & World Report ranking), you could combine your dataset with that information for further analysis.
- Given Hosler’s question of where one would go for medieval (military) history, you could ask the same question of any area of specialization. It might even be useful to know whether medieval (military) history has more or fewer schools than other equivalent subspecialties. Maybe look at the listed specializations of specific faculty and see if their alma mater currently has a specialist in that area?
- All sorts of possibilities in the data. And we haven’t even mentioned combining this AHA data with other data – political proclivities of a state relative to its History faculty, and so on…
These would be interesting not so much for specific individuals, but to look at broader trends in the profession. The kinds of analysis the AHA occasionally does, but on a more micro scale.
So if this has whetted your appetite, you know where to start.