Seriously though. I’ve known about the concept of ‘regular expressions’ for years, but for some reason I never took the plunge. And now that I have, my mind is absolutely blown away. Remember all those months in grad school (c. 1998-2000) when I was OCRing, proofing and manually parsing thousands of letters into my Access database? Well I sure do.
Twenty years later, I now discover that I could’ve shaved literally months off that work, if only I’d adopted the regex way of manipulating text. I’ll blame it on the fact that “digital humanities” wasn’t even a thing back then – check out Google Ngram Viewer if you don’t believe me.
So let’s start at the beginning. Entry-level text editing is easy enough: you undoubtedly learned long ago that in a text program like Microsoft Word you can find all the dates in a document – say 3/15/1702 and 3/7/1703 and 7/3/1704 – using a wildcard search like 170^#, where ^# is the wildcard for any digit (number). That kind of search will return 1701 and 1702 and 1703… But you’ve also undoubtedly been annoyed when you next learn that you can’t actually modify all those dates, because the wildcard character will be replaced in your basic find-replace with a single character. So, for example, you could easily convert all the forward slashes into periods, because you simply replace every slash with a period. But you can’t turn a variety of dates (text strings, mind you, not actual date data types) from MM/DD/YYYY into YYYY.MM.DD, because you need wildcards to find all the digit variations (3/15/1702, 6/7/1703…), but you can’t keep those values found by wildcards when you try to move them into a different order. In the above example, trying to replace 170^# with 1704 will convert every year with 1704, even if it’s 1701 or 1702. So you can cycle through each year and each month, like I did, but that takes a fair amount of time as the number of texts grow. This inability to do smart find-replace is a crying’ shame, and I’ve gnashed many a tooth over this quandary.
Enter regular expressions, aka regex or grep. I won’t bore you with the basics of regex (there’s a website or two on that), but will simply describe it as a way to search for patterns in text, not just specific characters. Not only can you find patterns in text, but with features called back references and look-aheads/look-backs (collectively: “lookarounds”), you can retain those wildcard characters and manipulate the entire text string without losing the characters found by the wildcards. It’s actually pretty easy:
Yep, it’s been a computational summer. Composed mostly of reading up on all things digital humanities. (Battle book? What battle book?) Most concretely, that’s meant setting up a modest Digital History Lab for our department (six computers, book-microfilm-photo scanners, a Microsoft Surface Hub touch display, and various software), and preparing for a brand new Intro to Digital History course, slated to kick off in a few weeks.
I’ve always been computer-curious, but it wasn’t until this summer that I fully committed to my inner nerdiness, and dove into the recent shenanigans of “digital humanities.” Primarily this meant finally committing to GIS, followed by lots of textual analysis tools, and brushing up on my database skills. But I’ve even started learning Python and a bit more AppleScript, if you can believe it.
So, in future posts, I’ll talk a little less about Devonthink and a bit more about other tools that will allow me to explore early modern European military history in a whole new way.