Where have you been all my life?

Seriously though. I’ve known about the concept of ‘regular expressions’ for years, but for some reason I never took the plunge. And now that I have, my mind is absolutely blown away. Remember all those months in grad school (c. 1998-2000) when I was OCRing, proofing and manually parsing thousands of letters into my Access database? Well I sure do.

Twenty years later, I now discover that I could’ve shaved literally months off that work, if only I’d adopted the regex way of manipulating text. I’ll blame it on the fact that “digital humanities” wasn’t even a thing back then – check out Google Ngram Viewer if you don’t believe me.

So let’s start at the beginning. Entry-level text editing is easy enough: you undoubtedly learned long ago that in a text program like Microsoft Word you can find all the dates in a document – say 3/15/1702 and 3/7/1703 and 7/3/1704 – using a wildcard search like 170^#, where ^# is the wildcard for any digit (number). That kind of search will return 1701 and 1702 and 1703… But you’ve also undoubtedly been annoyed when you next learn that you can’t actually modify all those dates, because the wildcard character will be replaced in your basic find-replace with a single character. So, for example, you could easily convert all the forward slashes into periods, because you simply replace every slash with a period. But you can’t turn a variety of dates (text strings, mind you, not actual date data types) from MM/DD/YYYY into YYYY.MM.DD, because you need wildcards to find all the digit variations (3/15/1702, 6/7/1703…), but you can’t keep those values found by wildcards when you try to move them into a different order. In the above example, trying to replace 170^# with 1704 will convert every year with 1704, even if it’s 1701 or 1702. So you can cycle through each year and each month, like I did, but that takes a fair amount of time as the number of texts grow. This inability to do smart find-replace is a crying’ shame, and I’ve gnashed many a tooth over this quandary.

Enter regular expressions, aka regex or grep. I won’t bore you with the basics of regex (there’s a website or two on that), but will simply describe it as a way to search for patterns in text, not just specific characters. Not only can you find patterns in text, but with features called back references and look-aheads/look-backs (collectively: “lookarounds”), you can retain those wildcard characters and manipulate the entire text string without losing the characters found by the wildcards. It’s actually pretty easy:

Step 1: Get yourself a real text editor – preferably not MS Word. I use BBEdit, the bigger brother of the free TextWrangler, but PC types tend to use Notepad++. And use plain text files.

Step 2: Learn regular expressions. Many websites offer descriptions of the syntax; there are even regex tester websites where you paste in some text, then write a regex and the site will interactively show you which parts of your text are selected by your regex. And don’t worry: it may take a little while to ramp up, but the thing about computer tools is that you can start with small accomplishments and gradually build up to more complicated, and powerful, solutions. Remember, too, that there are usually multiple regex formulae to find particular patterns: some are more ‘greedy’ (i.e. less forgiving) than others, some are more efficient, etc. But even if you use kiddie-pool regex, you’ll still be saving yourself a lot of time and effort.

Step 3: Get to work. Open up the desired text file in your text editor, and use its regex tools to make those changes – BBEdit, for example, has the Find… command. So, to follow the example above, your 3/15/1702 can easily be converted into 1702.03.15 in two quick steps:

a) Create a regex to find all those dates. The regex syntax is pretty basic, but can be combined in powerful ways. (Be sure to ignore the punctuation after my following regex, because periods have a special meaning). At it’s most basic: want to find any single alphanumeric lower case character? Search for [a-z]. Upper case? [A-Z]. Either? [a-zA-Z]. Upper case followed by lower? [A-Z][a-z]. Want to find any single consonant (in English at least)? Use [b-df-hj-np-tv-xz], or [b-df-hj-np-tv-z] if you want to include that ‘sometimes y’ wannabe. Want any single alphanumeric character? [a-zA-Z0-9]. And those are only the simplest building blocks. I haven’t even mentioned the “repeat this 0 times, 1 time, any number of times…”. Check out online guides for the full syntax.

In our date case, the find regex for the pattern MM/DD/YYYY might be something like:
(\d\d)\/(\d\d)\/(\d\d\d\d)
where \d is any digit (0-9), \/ is an ‘escaped’ forward slash (because a forward slash in regex can mean something else, so you need to distinguish the two types by preceding it by a backslash), and the parentheses around each set of numbers allow us to retain those same values (whatever they may be) in a replace action – the ‘back reference‘ referenced earlier. These back references assign an index number to each parenthesis group: in this case, the first two digits (our month) will be referred to in the replace-regex-to-come as \1, the second two digits are \2, and the final four digits are \3.

Note also that if you have some dates that don’t have the preceding (‘leading’) zeros, e.g. 3/4/1702 vs. 3/04/1702, then you need a slightly different regex to capture those possible patterns. For this, you can use the curly brackets {} with a specification of how often the pattern can repeat. A narrow version of this would be:
(\d{1,2})\/(\d{1,2})\/(\d\d\d\d)
This regex says: find all strings where there’s only one or two digits, followed by a forward slash, followed by another one or two digits, followed by another forward slash, followed by four digits. If you also wanted to find a date like 5/4/04 as well as dates like 4/5/1704 and 4/03/1704, you could change the last parentheses group to (\d{2,4}) – find a year that has 2-4 digits in a row, preceded by those other repeating digits and forward slashes.

To show it in the regex tester regex101.com, without the parenthetical back references:Screenshot 2017-08-10 13.07.43

As needed, you can add other qualifications to your regex. Need to only find the MM/DD/YYYY that appear at the beginning of a line? Add ^ to the front. Need to find all dates excluding certain values, maybe you want to avoid 1705? Maybe you use [^5] to specify which characters you want to avoid. Need to only find dates in the middle of a sentence? Try adding [a-z]\s to the beginning – this will only find those dates that are preceded by any lower case letter [a-z] followed by a (white)space. Need a specific range of decades? Try 17[0-2][0-9], for the period 1700-1729. Or maybe you know it’s either 1705 or 1706 you want – use the pipe ‘or‘ character, like 170[5|6].

If you want to find dates in a sentence but not ‘capture’ the preceding last letter, the [a-z]\s, you use lookarounds. In this case, you type (?<=[a-z]\s) at the front, and it will select the date, but only from those dates that have a lower case letter and space before them – the regex ‘looks behind’ for the values inside (?<=), but doesn’t include them in its selection. You could do the same with a look ahead, for what characters come after the target string. As you can see, though regex syntax is relatively simple, a regex can become very complex.

Note that all this regex works only because the date format is unique – if you had three sets of numbers separated by forward slashes, but not representing dates, then you’d need a different regex. Try our regex above on text that includes something like 903/34/937 – maybe it’s some weird archive call number. In that case, you might want to include an \s at the beginning of the expression – but will that catch dates at the beginning of a new line? Remember that spaces and line breaks are different characters, and therefore finding a date at the beginning of a new line isn’t necessarily the same as finding one in the middle of a line. In other words, you gotta know your text when playing around with regular expressions.

b) Having found the pattern you seek, you can now make another, often simpler, regex to replace it. The replace regex for our date example would then be:
\3.\1.\2
where you swap the order of the three parentheses groups, and add a period between them – you don’t need to escape the period here because, in BBEdit at least, it replaces them as normal characters rather than as regex. Note, however, that these are still just text strings, not date data that could be manipulated through addition, subtraction, etc.

So, my regex takeaways, which are really only common sense, include to always:

  • Work on a backup copy of your text
  • Use a regex tester (many available online) when you’re first starting out, or aren’t sure about what it might do
  • Include multiple false positive scenarios in your sample test text – this, of course, is a lot easier with experience
  • Remember that the text you’re searching for is surrounded by other text that might also match your pattern, so you need to remember to take those contextual characters into account with your regex, whether to limit your hits via anchors, or as a potential look around
  • Cycle through the first several edits, just to check for any unforeseen consequences (and there will be unforeseen consequences)
  • Check to see how many edits were made with each regex – if it’s a surprisingly large or small number, you’d better recheck your regex
  • Save (maybe a different copy) after each major edit
  • Start with the most narrow edit possible – you can easily broaden your regex to additional characters when you find cases that slipped through your net, but it’s hard to undo the nuclear option. Starting narrow (e.g. be very careful with the .+ “I will take everything from you” syntax) helps defang the quote you often see associated with regex: “Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.”

There are plenty of tutorials on regex online, but most I’ve seen are for computer programmers, for people using more structured data, and for people already at the power user level or higher. So I though I’d give some examples of how to use regex to clean up historical freeform texts, whether they be OCRed, or downloaded from a library database, for mere mortals.  The Programming Historian has two good tutorials on regex, with more structured data in mind, here and here.

Regex Applied

My date example above might pique the interest of historians, but what follows is a description of how I used regex to clean up a large text file for more detailed research – note that there are likely more efficient ways to do this, but since I want to check each step, I’ll be more cautious with my regex find-replace.

The document is David Jones’ A Compleat History of Europe, Or, A View of the Affairs Thereof, Civil and Military: From the Beginning of the Treaty of Nimeguen, 1676, to the Conclusion of the Peace with the Turks, 1699 … The Whole Intermix’d with Divers Original Letters, Declarations, Papers, and Memoirs, Never Before Published. T. Mead, 1699.
It’s over 700 pages of text, and fortunately EEBO transcribed it all, and made it available via the Text Creation Partnership. So now I’ve got a text file of 371,000 words, or 2.25 million characters. And while the transcription was superb (manual double-keyed, I’m guessing), there is still a lot of junk that I want to clear out.

Screenshot 2017-08-10 14.15.43

So here’s what I do:

  1. Save a separate version of the text file to work on.
  2. Strip out all the junk at the beginning of the file, particularly all those tags <…>. Easy enough just by selecting and deleting up to the title page content.
  3. I can already tell there will be lots of tags <…>, so I check for any other tags. Turns out that every page has some metadata that needs to be eliminated, such as the transition from one page to another:

    “….the single Mediation of /Charles/ the II, King of /Great Britain;/ (that
    of the Pope’s, after much

    View document image [10] containing page [2] Document Images
    </search/full_rec?action=ByID&source=pgimages.cfg&ID=12539519&VID=62930&PAGENO=10&SUBSCRIBER_TCP=Y&FILE=../session/1268405015_5217&SEARCHCONFIG=var_form.cfg&HIGHLIGHT_KEYWORD=default>

    Delay and many Debates, being at last rejected by all the Parties….”

    So that means we need to get rid of all of that stuff separating the first part of the sentence from the second part, on every page. I’d rather not do that hundreds of times (700+ pages, remember), so regex to the rescue! I examine the patterns, and realize that I can select the “View document… ” line with the regex:
    \nView document.+\n
    This finds all lines in between two new lines that start with View document and any other number of characters (the very greedy dot and plus combination), up until and including the next new line. Note that if you replace this with nothing, it also deletes the page numbers, but this isn’t a problem in this use case, since this version of the file will be for text mining and not citation. I have a separate PDF for the original pages images, and the original text file for text searching as needed.
    So I choose Replace & Find a few times, to make sure the regex is doing what I want and no more, and then I Replace All. 510 of those lines are deleted just like that! Note that it’s deleting the two line breaks on either side. If I wanted to keep those, I’d add them back in the Replace field: \n\n, or I could use lookarounds. And if I was really curious, I’d keep in mind the fact that about 510 replacements were made, since that doesn’t seem to equal the number of pages (or half that).

  4. But now I’ve still got those other inter-page <tags> blathering on about the search and whatnot. So I can just select all that text from the open < tag to the close > tag with an ultra-simple yet ultra-powerful regex: <.+>. This finds every instance of <any amount of text up to the>. Replace & Find a few times for quality control, then Replace All and – poof! – over 1,200 of those suckers gone.
  5. But now those sentences are still unfairly segregated from their brethren; they’re incomplete. If I’m only interested in the words of the document (word frequencies and the like), this step isn’t important, but since I care about sentences and word sequences (n-grams, keywords-in-context, and so on), I’d like to get rid of those separations, just in case any text analysis programs don’t look beyond line breaks. So I could do a search for multiple new lines and decrease their number – maybe replace \n\n\ with nothing, or \n\n with \n. This might, however, delete original paragraph marks, so think carefully about this. At the least, consider putting a padded space in its place. Or possibly add a [^A-Z] at the end of your search – “unless there’s an upper case letter that follows,” though this won’t be foolproof, since many words were capitalized even if they didn’t begin a sentence. You’ll probably have to go through several different types of find-replace regexes to clean up those page splits.
  6. A bit more sophisticated, I could focus my attention on the end and beginning of the sentences that are split. How do I know if it’s a split sentence? Well, the line will probably end in a comma or lower case letter (maybe with some padded spaces), there will be one or more new lines, and the next line will begin with a lower case letter, or maybe even more leading spaces or even a tab. You could try a regex like: ([a-z])\n\n([a-zA-Z]), and replace the results with \1 \2. You probably want to do several iterations before you commit to Replace All with this one. And there are numerous other variations you might test out.
  7. You also notice that the text is full of slashes, which look like tags used to indicate formatting words in italics and so on. I don’t want this in my version, so I’ll just strip them out, replacing / with nothing. Easy enough, as the computer does the work of replacing 27,000 of those. But, if you notice there’s a particular pattern of italicization, then maybe it’s worthwhile to consider keeping the forward slashes in. For example, it looks like the author is using italics to indicate proper nouns like place names, and some adjectives (French, Dutch). So maybe we could keep those if we might want to do some XML tagging later on.
  8. You always want to skim through the document, on the look out for other oddities. One thing you might notice is that this transcription helpfully includes the text in the margins, but is rather careless in where they place the marginal text. For example:
    “An Act for Raising Money by a Poll, Note in marg: Acts of Parliament signed.

    payable Quarterly, for one Year.”

    Here we have useful information, but it’s splitting up our precious sentences. Regex ho! So you could use an expression like:(^[A-Z].+)(Note in marg.+)\n
    and replace it with \2\n\1. Not perfect, but at least it gets the sentence fragments closer to each other.

  9. The eagle eye will also notice one other annoying thing: the insertion of ‘Single illegible letter’ in numerous words, like this: “carry hSingle illegible letterm the News.” In most cases, the missing letter is obvious – “Pat, I’d like to solve the puzzle: ‘ham.’ Oh, wait…”. So I Replace & Find to iterate through each one, manually typing the letter. If the missing letter is not clear, I can just type a ?.

So you get the idea. Lots of things to clean up, but each item can be fixed relatively quickly. Assuming you understand how regex works. So get to work!

Advertisements

Tags:

3 responses to “Where have you been all my life?”

  1. Gavin Robinson says :

    I’m told that the safest and most efficient way to remove tags that start and end with angle brackets is:

    ]+>

    If you use the dot, you have to make it lazy by putting ? after the + otherwise you get everything from the first , but that’s said to be less efficient. It’s explained in this tutorial.

    • Gavin Robinson says :

      That’s not displaying properly but you can see it at the linked tutorial.

      • jostwald says :

        Thanks Gavin. It’s funny how I went from a very conservative step 3 (delete tag only if it has “View document blah blah blah”) to foolhardiness in step 4 (delete ’em all, I don’t care).

        I should’ve mentioned that there are no other XML/HTML tags in the document (beyond the intro tags that I deleted in step 2), and that we don’t want to keep any of the content between any angle tags — the forward slashes serve as formatting markup. So with this type of document, there won’t be that problem of a greedy capture of nested tags. If there had been other XML tags, a few Finds (before replacing) should’ve highlighted the issue, literally.

        But generally your warning is an important one, illustrating the risk of regex adding another problem to your current one.

        Various text editors, including BBEdit, also have menu commands to Remove Markup or strip HTML tags, if people want to play it safe.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: