Archive | Methodology RSS for this section

Cleaning Text with Python

So all us early modern Europeanists owe the Early English Books Online project a debt of gratitude. Tens of thousands of books published in England before the 19C, all of them scanned, and, in the past few years, downloadable. Thanks to the Text Creation Partnership, some 60,000 of these 125,000 books have been transcribed into full-text versions, mostly those published before 1700. Next year, 2020, everyone with an internet connection will have access to all 60,000. For now, those without an institutional subscription will have to do with only 25,000 or so. Life is hard.

No surprise, scholars have been using this resource for years, but only recently have the digital humanities matured to where we can deal with this mass of text on a larger scale, using it for more than just individual keyword searches. If you want to download what’s publicly available, you should visit the Visualizing English Print project. But as VEP explains, the hand-transcribed texts have their issues. So they’ve created ‘SimpleText’ versions of the TCP documents – no more outdated XML markup for us! And they’ve also created processed versions that have cleaned some of the most common errors in the corpus.

VEP is a great service. But I want more. So I decided to learn Python and create my own Python code (in a Jupyter notebook) to clean these EEBO TCP texts on my own terms. Some of my corrections replicate what VEP has done, but my code also goes beyond to make further changes. I’ll spare you the details here, but I go into an obscene amount of detail in the Jupyter notebook, explaining the various errors I’ve encountered, and how I went about fixing them. The code isn’t perfect, but it does a pretty good job so far, if only through repetitive brute force. And it’s really helped me learn some basic Python along the way.

Though it won’t make too much sense until you go through the notebook, here’s a summary of the variety of errors the notebook checked for in the TCP’s 1640 edition of the Duke of Rohan’s Compleat Captain (commentaries on Caesar), and how many of each it found and corrected:Screenshot 2019-05-27 13.45.15.png

 

If you need a sample of the specific changes made:Screenshot 2019-05-27 13.47.04.png

And this is only the beginning.

So if you’re Python-curious and wonder what all the fuss is about, you can check out my GitHub repository: https://github.com/ostwaldj/eebo_tcp_clean_text. But be warned – for it to work, you’ll need to know a tiny bit of Python, and have Python 3+ as well as Jupyter notebooks (preferably via Anaconda) already installed. Once you have Python/Jupyter installed, you should be able to just download the repo, unzip it, open the Jupyter notebook, change the path to your machine, and it should be ready to go, at least on my sample Rohan text. For those with just a little bit of Python knowledge, it should be easy to alter the code, e.g. to expand it to cover additional types of errors or change, with just a little bit of hacking.

Hopefully, in the future, I’ll have time to set it up with MyBinder, so it can be run by anyone in a web browser.

To the future!

Advertisements

Sabbatical in the rear-view mirror

Now that my sabbatical has officially ended, the summer begins. I’ll gradually share with the world all the wonderful digital discoveries from my Year of the Digital. Discoveries that have so engulfed my world that I’ve slighted the blog for several months. But a short teaser list will suffice for now.

What did I do over the past year+ of “me-time”? Why, I…

  1. Learned enough Python to become a danger to myself, and the historical community more generally.
  2. Learned enough QGIS (Geographical Information Systems) to visualize the fruition of my 25-year dream to map early modern military operations.
  3. Reacquainted myself with some of the gritty details of MS Access and relational databases. Because somebody’s gotta make a giant dataset of all those early modern wars.

Don’t worry, there are still plenty of digital skills/tools to work on, including learning graph databases and learning enough web tools to host custom databases and maps. And let’s not forget collecting the data to put in those digital tools. Digital history is the wave of the future, after all. Today.

But first on the list is to share my first (major) Python project with the world – code that will clean EEBO TCP text documents, making them easier to analyze with natural language processing (NLP) techniques. Coming soon…

You might be Millner

If words like “Army”, “Camp”, “march”, “Day”, “pitch”, and “Leagues” outnumber many common stopwords…

Screenshot 2019-01-11 13.07.01.png

You might be a campaign journal.

And if the fifth-most common word token is “d”, and if “Duke” and “Prince” are close behind, and if you capitalize your common nouns, you are pretty well assur’d that you are, in fact, an 18th century Campaign Journal.

Millner’s Compendious Journal (1733), to be precise.

For those moderns sticklers for method, lowercasing the text doesn’t invalidate the point:

Screenshot 2019-01-11 13.25.30.png

Now, cleaning the dirty OCRed text? That’s another matter…

From historical source to historical data

Where I offer a taste of just one of the low-hanging fruits acquired over my past five months of Python: The Sabbatical.

Digital history is slowly catching on, but, thus far, my impression is that it’s still limited to those with deep pockets – big, multi-year research projects with a web gateway and lots of institutional support, including access to computer scientist collaborators. Since I’m not in that kind of position, I’ve set my sights a bit lower, focusing on the low-hanging fruit that’s available to historians just starting out with python.

Yet much of this sweet, juicy, low-hanging fruit is, tantalizingly, still just out of reach. Undoubtedly you already know that one of the big impediments to digital history generally, and to historians playing with the Python programming language specifically, is the lack of historical sources in a structured digital format. We’ve got thousands of image PDFs, even OCRed ones, but it’s hard to extract meaningful information from them in any structured way. And if you want to clean that dirty OCR, or analyze the text in any kind of systematic way, you need it digitized, but in a structured format.

My most recent python project has been to create some python code that automates a task I’m sure many historians could use: parsing a big long document of textual notes/documents into a bunch of small ones. It took one work day to create it, without the assistance of my programming wife, so I know I’m making progress! Eventually I’ll clean the code up and put it on my GitHub account for all to use. But for now I’ll just explain the process and show the preliminary results. (For examples of how others have done this with Python, check out The Programming Historian, particularly this one.)

Parsing the Unparseable: Converting a semi-structured document into files

If you’re like me, you have lots of historical documents – most numerous are the  thousands of letters, diary and journal entries from dozens of different authors. Each collection of documents is likely drawn from a specific publication or archival collection, which means they begin being all isolated in their little silos. If you’re lucky, they’re already in some type of text format – MS Word or Excel, a text file, what have you. And that’s great if you want just to search for text strings, or maybe even use regular expressions. But if you want more, if, say, you want to compare person A’s letters with person B’s letters over the same timespan, or compare what they said about topic X, or what they said on date Z, then you need to figure out a way to make them more easily compared, to quickly and easily find those few needles in the haystack.

The time-tested strategy for historians has been to physically split up all your documents into discrete components and keyword and organize those individual letters (or diary entries, or…). In the old days – which are still quite new for some historians – you’d use notecards. I’ve already documented my own research journey away from Word documents to digital tools (see Devonthink tag). I even created/modified a few Applescripts to automate this very problem in Devonthink in a rudimentary way: one, for example, can ‘explode’ (i.e. parse) a document by creating a new document for every paragraph in the starting document. Nice, but it can be better. Python to the rescue.

The problem: lots of text files of notes and transcriptions of letters, but not very granular, and therefore not easily compared, requiring lots of wading through dross, with the likelihood of getting distracted. This is particularly a problem is you’re searching for common terms or phrases that appear in lots of different letters. Wouldn’t it be nice if you could filter your search by date, or some other piece of metadata?

The solution: use Python code to parse the documents (say, individual letters, or entries for a specific day) into separate files, making it easy to hone in on the precise subject or period you’re searching for, as well as precise tagging and keywording.

Step 1:

For proof of concept, I started with a transcription of a campaign journal kindly provided me by Lawrence Smith, in a Word document. I’m sure you have dozens of similar files. He was faithful in his transcription, even to the extent of mimicking the layout of the information on the page with the use of tabs, spaces and returns. Newhailes_sample1.pngGreat for format fidelity, but not great for easily extracting important information, particularly if you want, for example, June to be right next to 20th, instead of on the line below, separated by a bunch of officers’ names. (‘Maastricht’ and ‘London’ are actually a bit confusing, because I’m pretty sure the place names after the dates are that day’s passwords, at least that’s what I’ve seen in other campaign journals. That some of the entries explicitly list a camp location reinforces my speculation.) Of course people can argue about which information is ‘important,’ which is yet another reason why it’s best if you can do this yourself.

Aside: As you are examining the layout of the document to be parsed, you should also have one eye towards the future. In this case, that means swearing to yourself that: “I will never again take unstructured notes that will require lots of regex for parsing.” In other words, if you want to make your own notes usable by the computer and don’t already have a sophisticated database set up for data entry, use a consistent format scheme (across sources) that is easy to parse automatically. For example, judicious use of tabs and unique formatting:

Early_formatting_ideas.png

Step 2:

Clean up the text, specifically: make the structure more standardized so different bits of info can be easily identified and extracted. For this document, that means making sure each first line only consists of the date and camp location (when available), that each entry is separated by two carriage returns, and adding a distinctive delimiter (in this case, two colons, ‘::’) between each folio – because you’ll ultimately have the top level of your structured data organized by folio, with entries multiple entries per folio (this is a one-to-many relationship, for those of you familiar with relational databases like Access). Cleaning the text can be easily done with regex, allowing you to cycle through and make the appropriate changes in minutes. Assuming you know your regular expressions, that is.

The result looks like this:Newhailes_sample.png

Note that this stage is not changing the content, i.e. it’s not ‘preprocessing’ the text, doing things like standardizing spelling, or expanding contractions, or what have you. Nor did I  bother getting rid of extra spaces, etc. Those can be stripped with python as needed.

For this specific document, note as well that some of the formatting for the officers of the day is muddled (the use of curly brackets seems odd), which might equal loss of information. But if that info’s important, you should take care to figure out how to robustly record it at the transcription stage. If you’re relying on the kindness of others, ‘beggars can’t be choosers.’ But, if you’re lucky, you happen to have a scanned reproduction of a partial copy of this journal from another source, which tells you what information might be missing from the transcription:

Newhailes_sample_BL_Add61404.png

Camp journal sample of above, from British Library, Add MS 61404, f. 45.

You probably could do this standardizing within your Python code in Jupyter Notebook, but I find it easier to interact with regex in my text editor (BBEdit). Your mileage may vary.

 Step 3:

Once you get the text in a standard format like the above, you read it into python and convert it into a structured data set. If you don’t know Python at all, the following details won’t make sense. So go read up on some Python! One of the big hurdles for the neophyte programmer, as I’ve discovered over and over, is to see how the different pieces fit together into a whole, so that’s what I’ll focus on here. In a nutshell, the code does the following, after you’ve cleaned up the structure of the original document in your text editor:

  1. Read the file into memory as one big, long string.
  2. Perform any other cleaning of the content you want.
  3. Then you perform several passes to massage the string into a dictionary with a nested list for the values. There may be a better, more efficient way to do this in fewer lines, but my beginner code does it in three main steps:
    1. Convert the document to a list, splitting each item at the ‘f. ‘ delimiter. Now you have a list with each folio as a separate item.
      list_items.png
    2. Always look at your results. For some reason, the first item of the resulting list is empty (it doesn’t seem to be an encoding error), so just delete that item from the list before moving on.
    3. Now, read the resulting list items into a python dictionary, with the dictionary key the folio number, and all of the entries on the folio as the value of that folio. Use the ‘::’ as the delimiter here, with the following line of code, a ‘comprehension’, as they call it. Notice how the strip and split methods are chained together, performing multiple changes on the item object in that single bit of code:
      dictionary.png
    4. Now you use a for loop to parse each value into separate list items, using the other delimiter of ‘\n\n’ (two returns) between entries, using the string of the value (since otherwise it’s a list item and the strip and split methods only work on strings). This gives you a dictionary with the folio as the dict key, and the value is now a nested list with each of the entries associated with its folio as a separate item, as you can see with folio 40’s four entries:
      dictionary_nested_list.png

That’s pretty much it. Now you have a structure for your text. Congratulations, your text has become data, or data-ish at least. The resulting python dictionary allows you to search any folio and it will return a list of all the letters/entries on that folio. You can loop through all those entries and perform some function on/with them.  So that’s a good thing to “pickle”, i.e. write it to a binary file, so that it can be easily read back as a python dictionary later on.

Once you have your data structured, and maybe add some more metadata to it, you can do all sorts of analysis with all of Python’s statistical, NLP, and visualization modules.

But if you are still straddling the Devonthink-Python divide, like I am, then you’ll also want to make these parsed bits available in Devonthink. Add a bit of code to write out each dictionary key-value pair to a separate file, and you end up with several hundreds of files:

Newhailes_finder_folder.png

Each file will have only the content for that specific entry, making it easy to precisely target your search and keywording. The last thing you want to do is cycle through several dozen hits in a long document for that one hit you’re actually looking for.

Newhailes_sample_entry.png

That’s it. Entry of May 8th, 1705 in its own file.

The beauty is that you can add more to the code – try extracting the dates and camps, change what information you want to include in the filename, etc. Depending on the structure of the data you’re using, you might need to nest dictionaries or lists several layers deep, as discussed in my AHA example. But that’s the basics. Pretty easy, once you figure it out, that is.

Even better: now you can run the same code, with a few minor tweaks, on all of those other collections of letters and campaign journals that you have, allowing you to combine Newhailes’ entries with Deane’s and Millner’s and Marlborough’s letters and… The world’s your oyster. But, like any oyster, it takes a little work opening that sucker. Not that I like oysters.

Where the historians are, 2017

“Shaving the yak” is a phrase used to describe the process of programming. It alludes to the fact that you often have to take two, or more, steps backward in order to eventually move one step forward. You want a sweater, so first you need to get some yarn, but to do that you have to… and eventually you find yourself shaving a yak. The reason why you even consider shaving a yak is that, once you’ve shaved said yak, you now have lots of yarn, which allows you to make many sweaters. This colorful analogy has a surprising number of online images, and even an O’Reilly book. It’s a thing.

I have been doing a lot of digital yak-shaving over the past four months. Come to think of it, most of my blog posts consist of yak shaving.

So if you’re interested in learning to code with Python but not sure whether it’s worth it, or if you just want to read an overview of how I used Python and QGIS to create a map like this from a big Word document, then continue reading.history_programs_ba_map.png

 

Read More…

Wars of Italy, pt 2

A few more random maps of the Wars of Italy, just because it’s all I’ve got time for.

First off, the locations of various combats (battles and sieges mostly) from 1494-1559, color-coded by war, with the Natural Earth topo layer as base map. It might be more useful to group the wars together into a smaller number of categories (make a calculated field). Or maybe make them small multiples by war. But it’s a start.

Screenshot 2018-03-16 10.42.18.png

Then, using the Data defined override and Size Assistant style in QGIS 2.18, you can add army sizes to the symbols (sizeA+sizeB), to create a multivariate map. Note, however, that I don’t have very many army size statistics (the no-data events are all those tiny dots), but you get the idea – add a continuous variable to a categorical variable, and you’ve got two dimensions.

Screenshot 2018-03-16 10.50.10.png

Remember, with GIS and a good data set, the world’s your oyster.

Next up – getting that good data set. In other words, setting up the Early Modern Wars database in MS Access. What? You want to see my entity-relationship diagram so far? Sure, why not:

EMEWars ER diagram.PNG

And, once sabbatical hits this summer, I’ll be appealing to y’all (just got back from Texas) to help me fill in the details, to share our knowledge of early modern European warfare with the world.

Historical Research in the 21st Century

So let’s say you’ve become obsessed with GIS (geographical information systems). And let’s also posit that you’re at a teaching institution, where you rotate teaching your twelve different courses plus senior seminars (three to four sections per semester) over multiple years, which makes it difficult to remember the ins-and-out of all those historical narratives of European history from the 14th century (the Crusades, actually) up through Napoleon – let’s ignore the Western Civ since 1500 courses for now. And let’s further grant that you are particularly interested in early modern European military history, yet can only teach it every other year or so.

So what’s our hypothetical professor at a regional, undergraduate, public university to do? How can this professor possibly try to keep these various periods, places and topics straight, without burdening his (errr, I mean “one’s”) students with one damned fact after another? How to keep the view of the forest in mind, without getting lost among the tree trunks? More selfishly, how can one avoid spending way too much prep time rereading the same narrative accounts every few years?

Why, visualize, of course! I’ve posted various examples before (check out the graphics tag), but now that GIS makes large-scale mapping feasible (trust me, you don’t want to manually place every feature on a map in Adobe Illustrator), things are starting to fall in place. And, in the process, I – oops, I mean our hypothetical professor – ends up wondering what historical research should look like going forward, and what we should be teaching our students.

I’ll break my thoughts into two posts: first, the gritty details of mapping the Italian Wars in GIS (QGIS, to be precise); and then a second post on collecting the data for all this.

So let’s start with the eye-candy first – and focus our attention on a subject just covered in my European Warfare class: the Italian Wars of the early 16th century (aka Wars of Italy). I’ve already posted my souped-up timechart of the Italian Wars, but just to be redundant:

ItalianWars1494-1532PPT

Italian Wars timechart

That’s great and all, but it really requires you to already have the geography in your head. And, I suppose, even to know what all those little icons mean.

Maps, though, actually show the space, and by extension the spatial relationships. If you use PowerPoint or other slides in your classes, hopefully you’re not reduced to re-using a map you’d digitized in AutoCAD twenty years earlier, covering a few centuries in the future:

ItalySPM

Instead, you’ve undoubtedly found pre-made maps of the period/place online – either from textbooks, or from other historian’s works – Google Images is your friend. You could incorporate raster maps that you happen across:

Screenshot 2018-02-17 13.59.49

Maybe you found some decent maps with more political detail:

Screenshot 2018-02-17 13.59.58

Maybe you are lucky enough that part of your subject matter has been deemed important enough to merit its own custom map, like this digitized version of that old West Point historical atlas:

campaigns_charles_7

If you’re a bit more digitally-focused, you probably noticed a while back that Wikipedia editors have started posting vector-based maps, allowing you to open them in a program like Adobe Illustrator and then modify them yourself, choosing different fills and line styles, maybe even adding a few new features:

Italian Wars 1494 map

Now we’re getting somewhere!

But, ultimately, you realize that you really want to be your own boss. And you have far more questions than what your bare-bones map(s) can answer. Don’t get me wrong – you certainly appreciate those historical atlases that illustrate Renaissance Italy in its myriad economic, cultural and political aspects. And you also appreciate the potential of the vector-based (Adobe Illustrator) approach, which allows you to add symbols and styling of your own. You can even search for text labels. Yet they’re just not enough. Because you’re stuck with that map’s projection. Maybe you’re stuck with a map in a foreign language – ok for you, but maybe a bit confusing for your students. And what if you want to remove distracting features from a pre-existing map? What if you care about what happened after Charles VIII occupied Naples in early 1495? What if you want to significantly alter the drawn borders, or add new features? What if you want to add a LOT of new features? There are no geospatial coordinates in the vector maps that would allow you to accurately draw Charles VIII’s 1494-95 march down to Naples, except by scanning in another map with the route, twisting the image to match the vector map’s boundaries, and then eye-balling it. Or what if you want to locate where all of the sieges occurred, the dozens of sieges? You could, as some have done, add some basic features to Google Maps or Google Earth Pro, but you’re still stuck with the basemap provided, and, importantly, Google’s (or Microsoft’s, or whoever’s) willingness to continue their service in its current, open, form. The Graveyard of Digital History, so very young!, is already littered with great online tools that were born and then either died within a few short years, or slowly became obsolete and unusable as internet technology passed them by. Among those online tools that survive for more than a five years, they often do so by transforming into a proprietary, fee-based service, or get swallowed up by one of the big boys. And what if you want to conduct actual spatial analysis, looking for geospatial patterns among your data? Enter GIS.

So here’s my first draft of a map visualizing the major military operations in the Italian peninsula during the Italian Wars. Or, more accurately, locating and classifying (some of) the major combat operations from 1494 to 1530:

Screenshot 2018-02-17 13.40.19

Pretty cool, if you ask me. And it’s just the beginning.

How did I do it? Well, the sausage-making process is a lot uglier than the final product. But we must have sausage. Henry V made the connection between war and sausage quite clear: “War without fire is like sausages without mustard.”

So to the technical details, for those who already understand the basics of GIS (QGIS in this case). If you don’t know anything about GIS, there are one or two websites on the subject.

  • I’m using Euratlas‘ 1500 boundaries shapefile, but I had to modify some of the owner attributes and alter the boundaries back to 1494, since things can change quickly, even in History. In 1500, the year Euratlas choose to trace the historical boundaries, France was technically ruling Milan and Naples. But, if you know your History, you know that this was a very recent change, and you also know that it didn’t last long, as Spain would come to dominate the peninsula sooner rather than later. So that requires some work fixing the boundaries to start at the beginning of the war in 1494. I should probably have shifted the borders from 1500 back to 1494 using a different technique (ideally in a SpatiaLite database where you could relate the sovereign_state table to the 2nd_level_divisions table), but I ended up doing it manually: merging some polygons, splitting other multi-polygons into single polygons, modifying existing polygons, and clipping yet other polygons. Unfortunately, these boundaries changed often enough that I foresee a lot of polygon modifications in my future…
  • Notice my rotation of the Italian boot to a reclining angle – gotta mess with people’s conventional expectations. (Still haven’t played around with Print Composer yet, which would allow me to add a compass rose.) More important than being a cool rebel who blows people’s cartographic preconceptions, I think this non-standard orientation offers a couple of advantages. First, it allows you to zoom in a bit more, to fit the length of the boot along the width rather than height of the page. More subtly, it also reminds the reader that the Po river drains ‘down’ through Venice into the Adriatic. I’m sure I’m not the only one who has to explicitly remind myself that all those northern European rivers aren’t really flowing uphill into the Baltic. (You’re on you own to remember that the Tiber flows down into the Tyrrhenian Sea.) George “Mr. Metaphor” Lakoff would be proud.
  • I converted all the layers to the Albers equal-area conic projection centered on Europe, for valid area calculations. In case you don’t know what I’m talking about, I’ll zoom out, and add graticules and Tissot’s indicatrices, which illustrate the nature of the projection’s distortions of shape, area and distance as you move away from the European center (i.e. the main focus of the projection):
    Screenshot 2018-02-17 14.21.17
    And in case you wanted my opinion, projections are really annoying to work with. But there’s still room for improvement here: if I could get SpatiaLite to work in QGIS (damn shapefiles saved as SpatiaLite layers won’t retain the geometry), I would be able to re-project layers on the fly with a SQL statement, rather than saving them as separate shapefiles.
  • I’m still playing around with symbology, so I went with basic shape+color symbols to distinguish battles from sieges (rule-based styling). I did a little bit of customization with the labels – offsetting the labels and adding a shadow for greater contrast. Still plenty of room for improvement here, including figuring out how to make my timechart symbols (created in Illustrator) look good in QGIS.
    After discovering the battle site symbol in the tourist folder of custom markers, it could look like this, if you have it randomly-color the major states, and include the 100 French battles that David Potter mentions in his Renaissance France at War, Appendix 1, plus the major combats of the Italian Wars and Valois-Habsburg Wars listed in Wikipedia:
    Screenshot 2018-03-01 14.18.11.png
    Boy, there were a lot of battles in Milan and Venice, though I’d guess Potter’s appendix probably includes smaller combats involving hundreds of men. Haven’t had time to check.
  • I used Euratlas’ topography layers, 200m, 500m, 1000m, 2000m, and 3500m of elevation, rather than use Natural Earth’s 1:10m raster geotiff (an image file with georeferenced coordinates). I wasn’t able to properly merge them onto a single layer (so I could do a proper categorical color ramp), so I grouped the separate layers together. For the mountain elevations I used the colors in a five-step yellow-to-red color ramp suggested by ColorBrewer 2.0.
  • I saved the styles of some of the layers, e.g. the topo layer colors and combat symbols, as qml files, so I can easily apply them elsewhere if I have to make changes or start over.
  • You can also illustrate the alliances for each year, or when they change, whichever happens more frequently – assuming you have the time to plot all those crazy Italian machinations. If you make them semi-transparent and turn several years’ alliances on at the same time, their overlap with allow you to see which countries switched sides (I’m looking at you, Florence and Rome), vs. which were consistent:
    Screenshot 2018-03-01 14.27.00.png
  • Plotting the march routes is also a work in progress, starting by importing the camps as geocoded points, and then using the Points2One plugin to connect them up. With this version of Charles’ march down to Naples (did you catch that south-as-down metaphor?), I only had a few camps to mark, so the routes are direct lines, which means they might display as crossing water. More waypoints will fix that, though it’d be better if you could make the march routes follow roads, assuming they did. Which, needless to say, would require a road layer.
    Screenshot 2018-03-01 14.44.52.png
  • Not to mention applying spatial analysis to the results. And animation. And…

More to come, including the exciting, wild world of data collection.