Random thoughts on transcribing and typing sources

Now that there’s software that can actually take advantage of full text, that means each historian has hundreds of documents that could, nay, should, be entered as full text. So what’s a digital early modern historian to do? (I’ll ignore the more challenging handwritten documents and focus on the ‘easy’ published documents.)

  1. Type all those treatises, campaign narratives and histories yourself. Been there, done that. Not fun, especially when you keep finding more and more of them. It’s labor-intensive work that ideally would be done by someone who charges far less than what you get paid per hour, or far less than what you think your time is worth.
  2. Prioritize. What do you really need fully transcribed, and what can you get by with just reading and taking notes? Unfortunately certain types of documents, and certain methods/questions, may require full text. Or if you’re hoping to create some kind of monster textbase that will allow you to find every instance of a particular term, person or place in an instant.
  3. Optical character recognition. OCR software has been a godsend, especially given all those 19C-20C published collections of documents. The problem comes when trying to OCR published works from the 18C and earlier (even some 19C texts), or when dealing with imperfect copies of the most-pristine text: blurry photos, phantom hands, smudged pages… Still waiting for a solution to that.
    Even if the copy is perfect, the original typeface might not be. From personal experience I can confirm that editing OCRed text, even at 90% accuracy, takes forever. And don’t get me started on old italicized text. OCR no-likey.
    There’s ‘dirty’, i.e. uncorrected, OCR, and then there’s filthy OCR. Like this:

    The Ugly Side of OCR

    The Ugly Side of OCR

    The gutter is a black hole, the text on either side of it wasn’t even recognized as text (the box in the middle represent an image), and the text that was recognized has many errors. Oddly though, this is actually usable output, as long as you don’t expect every instance of every word to be found (or even most), and you aren’t trying to look at word frequencies and the like. So sometimes you just accept the dirty OCR as a semi-useful supplement to an image PDF.

  4. Or you could download full texts from online databases. Google Books and Archive.org both have OCRed text versions of many of their works, but that doesn’t help too much given the OCR issues with early print works.
  5. Some text versions of early modern works are available to subscribers of databases like EEBO, ECCO, Burney… Their hefty subscription prices help pay for their transcription costs. EEBO offers downloadable full-text versions for a small selection of its works (see the EEBO-Text Creation Partnership), while ECCO doesn’t allow you access to the underlying text at all, other than showing snippets in the results window. On one of my research jaunts I ended up just taking screenshots of each page of results (dozens), so that I could later match them up with my own PDFs at home. Pain in the ass.
    ECCO search results

    ECCO search results

    EEBO will be further extending its utility over ECCO by adding more text versions, as well as releasing more texts to the public, over the next several years. So one option is to simply delay your research till those get released. Assuming the works you need will be included in future releases.

  6. Recently I explored another option: I purchased a copy of voice recognition software (Dragon Naturally Speaking 11.5), went through the brief training process, and then started reading one of my English campaign narratives out loud. The process seemed to work well – by page 30 or so I was reading along in a monotone voice at a pretty decent clip. And the results looked really good on the screen. Too good, as it turns out. I discovered that voice recognition software is, in some ways, more dangerous than OCR software. The errors from OCR software occur on the level of individual letters, which means that you can quickly skim over a page and the misspelled words will jump out at you, and almost every error (aside from the stray speck interpreted as the letter i) will be an incorrect letter, at most two letters substituted for one. For many of these you can use fuzzy search and they won’t make much of a difference – assuming the misspelling is only off by one letter or two. OCR may give you a lot of errors, but they also tend to be consistent: in a given book, the letter g may always gets confused for a q, the ii is usually a u, and uneven inking might result in many e’s being read as c’s. Those can be easily fixed with a well-thought-out global find and replace. This also means that most OCR errors can be easily ‘fixed’ by reading them in context – ctc. becomes etc., sepamte becomes separate, and so on.
    Voice recognition eliminates the inability of OCR software to interpret irregular and faded fonts, because your human eye is processing the text instead. But it replaces the computer’s difficulty interpreting imperfect visual symbols with the computer’s difficulty interpreting imperfect aural symbols. The result: completely different types of errors. It insists on typing real words, so if it doesn’t know the word you’re saying, it types one (or more) that sounds (somewhat) like that word. Which means it’s almost impossible to just glance over a page and have the errors jump out at you – they’re all words alright, just not the right ones. So you have to read every sentence to see if the words actually make sense in the sentence. Grammar checkers aren’t helpful because one noun or verb or adjective is as good as another in most cases – try grammar checking Chomsky’s famous Colorless green ideas sleep furiously. Even worse, if Dragon can’t figure out what four-syllable word you just said, it figures you must’ve said two two-syllable words, or maybe three one-syllable words and a one-syllable word. Whichever combination of words are in its dictionary and sound most similar to what you said. To prevent this, you could watch what’s typed on the screen as you dictate, but that would require you stop looking at the printed text you’re reading from. That slows down your pace quite a bit, unless you’re good at memorizing paragraphs of text at a glance. (Come to think of it, maybe having the image PDF in an adjacent window on the screen might help? I’ll have to experiment, but it’d probably interfere with what allows us to read quickly – glancing ahead.)
    But that’s not the end of your problems. Any document that discusses places, say a campaign narrative describing all the places around which armies maneuvered or mentioning the polyglot officers and regiments performing such maneuvers, will require you to program an alias in for each proper name, or train the software to type Esquerchin properly, or you’ll be surprised by the results. The problem is multiplied because on one page you’ll be talking about Esquerchin, and on the next page the army will have moved on to Hénin-Beaumont. So you may have dozens of small places not in the standard Dragon dictionary, and not really predictable until you come across them in the text. Nor does Dragon like it when you pronounce those French villages in French, at least when dictating an otherwise-English document. But it can even have problems with English. It’s less likely to choose an archaic English word over a more common modern one (or two, or three…). The complex, long 18C sentences also seem to play havoc with its ability to guess a word based on grammatical context. Sometimes even I need the commas to make sense of the sentence, but saying “comma” four times in every sentence gets old real quick (and you never know which commas are important until you’re at the comma). Speaking out 18C punctuation probably increases the number of words to say by a third! And you can forget about preserving all the orthographic complexities of early modern English, fuffice to fay.
    There are some fixes to these problems. You can, for example, check the transcription every sentence or so. Or you can just be content with the 1% (less? more?) of the words that make absolutely no sense – which is fine unless they are important words, or unless you need to understand the sentence that it appears in (maybe I need to practice my phonetic reading). Or if you hear like a computer, you could pre-read through the text, trying to identify which words will be problematic. The only other alternative I see is to read through every sentence afterwards just to find those small number of unpredictable but possibly important (and certainly confusing) errors. All these fixes necessarily slow down the data entry speed, which is kinda the whole point after all. Which makes me ambivalent about voice recognition for transcribing early modern historical sources, unless of course they don’t mention any people and places, or use old-fashioned language!

So what’s left? The only thought left is the old-fashioned way – hire somebody to type it all out. That’s what I’m seriously considering at this stage. Suggestions?

Advertisements

Tags: ,

3 responses to “Random thoughts on transcribing and typing sources”

  1. Björn Thegeby says :

    I am a fan of ABBYY Reader’s capability to ocr a graphical lpg/pdf file and combine the ocr data into the resulting pdf file as searchable text. A set of photos of pages from an 18-century book are combined into a graphical pdf file and ocr’d. The output is beautiful (apart from the image of my fingers holding the pages flat).

    Even if the orthography and bad type set plays havoc with what would be an incomplete text pdf file, it works very well as graphical pdf + underlying text. Just like Google Books, come to think of it, but portable and with access assured.

    • jostwald says :

      What that does, of course, is just hide all the OCR errors behind the image – a false sense of security in my experience. (ECCO does that as well.) As long as you’re just using the PDF+searchable text with your eyes, it is indeed a good compromise, though you can’t correct any errors since you can’t see them. You can use it in Devonthink that way also, although you want the rtf files for other reasons. But if you are worried about file sizes, wanting to use other software to analyze the text, etc., then rtf/txt files are necessary.

  2. Ralph Hitchens says :

    I review books for the Journal of Military History & a few other venues. The list price of many of the books they send me is staggering, and in my reviews I often conclude with some sort of plea for a new publishing model. Obviously it would be nice to “Kindleize” everything for no more than $20 or so, but what impact would that have on publishers & authors revenue, and would this cut seriously into the production run of hard-copies needed for academic libraries? This whole business is passing strange….

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: