Archive | May 2019

Cleaning Text with Python

So all us early modern Europeanists owe the Early English Books Online project a debt of gratitude. Tens of thousands of books published in England before the 19C, all of them scanned, and, in the past few years, downloadable. Thanks to the Text Creation Partnership, some 60,000 of these 125,000 books have been transcribed into full-text versions, mostly those published before 1700. Next year, 2020, everyone with an internet connection will have access to all 60,000. For now, those without an institutional subscription will have to do with only 25,000 or so. Life is hard.

No surprise, scholars have been using this resource for years, but only recently have the digital humanities matured to where we can deal with this mass of text on a larger scale, using it for more than just individual keyword searches. If you want to download what’s publicly available, you should visit the Visualizing English Print project. But as VEP explains, the hand-transcribed texts have their issues. So they’ve created ‘SimpleText’ versions of the TCP documents – no more outdated XML markup for us! And they’ve also created processed versions that have cleaned some of the most common errors in the corpus.

VEP is a great service. But I want more. So I decided to learn Python and create my own Python code (in a Jupyter notebook) to clean these EEBO TCP texts on my own terms. Some of my corrections replicate what VEP has done, but my code also goes beyond to make further changes. I’ll spare you the details here, but I go into an obscene amount of detail in the Jupyter notebook, explaining the various errors I’ve encountered, and how I went about fixing them. The code isn’t perfect, but it does a pretty good job so far, if only through repetitive brute force. And it’s really helped me learn some basic Python along the way.

Though it won’t make too much sense until you go through the notebook, here’s a summary of the variety of errors the notebook checked for in the TCP’s 1640 edition of the Duke of Rohan’s Compleat Captain (commentaries on Caesar), and how many of each it found and corrected:Screenshot 2019-05-27 13.45.15.png

 

If you need a sample of the specific changes made:Screenshot 2019-05-27 13.47.04.png

And this is only the beginning.

So if you’re Python-curious and wonder what all the fuss is about, you can check out my GitHub repository: https://github.com/ostwaldj/eebo_tcp_clean_text. But be warned – for it to work, you’ll need to know a tiny bit of Python, and have Python 3+ as well as Jupyter notebooks (preferably via Anaconda) already installed. Once you have Python/Jupyter installed, you should be able to just download the repo, unzip it, open the Jupyter notebook, change the path to your machine, and it should be ready to go, at least on my sample Rohan text. For those with just a little bit of Python knowledge, it should be easy to alter the code, e.g. to expand it to cover additional types of errors or change, with just a little bit of hacking.

Hopefully, in the future, I’ll have time to set it up with MyBinder, so it can be run by anyone in a web browser.

To the future!

Advertisements

Sabbatical in the rear-view mirror

Now that my sabbatical has officially ended, the summer begins. I’ll gradually share with the world all the wonderful digital discoveries from my Year of the Digital. Discoveries that have so engulfed my world that I’ve slighted the blog for several months. But a short teaser list will suffice for now.

What did I do over the past year+ of “me-time”? Why, I…

  1. Learned enough Python to become a danger to myself, and the historical community more generally.
  2. Learned enough QGIS (Geographical Information Systems) to visualize the fruition of my 25-year dream to map early modern military operations.
  3. Reacquainted myself with some of the gritty details of MS Access and relational databases. Because somebody’s gotta make a giant dataset of all those early modern wars.

Don’t worry, there are still plenty of digital skills/tools to work on, including learning graph databases and learning enough web tools to host custom databases and maps. And let’s not forget collecting the data to put in those digital tools. Digital history is the wave of the future, after all. Today.

But first on the list is to share my first (major) Python project with the world – code that will clean EEBO TCP text documents, making them easier to analyze with natural language processing (NLP) techniques. Coming soon…