Data entry is fun!

Following on a previous post’s comments, if anyone is interested in helping to transcribe sources, respond (ostwaldj at easternct dot edu) and I’ll shoot you a PDF chapter from a late 17C manual. If you get hooked, more could follow. I’ll post the final results back up to the blog in case anybody else wants to share in the fruits. Crowdsourcing they call it.

EEBO and ECCO have many of these manuals as image PDFs, but only a few are available to download in full text through the Text Creation Partnership. And, as speculated on earlier, it seems Google Books is slowly becoming a marketing website, possibly even removing some image PDFs and replacing them with links to online booksellers that will sell you a copy of the work.

First up for our transcription experiment is Nathaniel Boteler’s War Practically Performed (orig. 1663). It doesn’t look like it’s available in text anywhere (not even the image files in Google Books), but I have a scan from an old microfilm.

Boteler War Practically Performed

To see why good ol’-fashioned human transcription is still needed, and for those curious about what OCR software looks like (it’s usually behind the scenes in Google Books), here’s a snapshot from one of the best off-the-shelf packages, ABBYY FineReader:

Sample OCR software screenshot

Sample OCR software screenshot

The panes show a variety of views:

  • on the left, an overview of the document as a whole – the red pages mean lots of errors
  • next to it a full image of the selected page – the green indicates which areas will be ‘read’ as text
  • next to that the resulting OCRed text from FineReader – the blue highlights indicate results the software is unsure about (but there could still be errors elsewhere, if, for example, a valid-yet-incorrect word was read)
  • and at bottom a closeup of the specific line where the cursor is inserted, for detailed comparison.

With this interface the user can then correct all the errors, as well as perform a variety of other manipulations on the image and text. The text can then be exported to a variety of formats.

As you can see, the poor quality of this image (especially all those dots in the background) and the variability of the fonts in the 17C-18C (particularly the use of italics) makes it difficult for computers to interpret the letters. Computers are dumb. But that’s where the human brain comes in.

The software interface makes it relatively easy to correct such errors. But as you can also see, there are numerous errors, only some of which are systemic and easily corrected with batch find-and-replace – stray speck or discoloration might be misinterpreted as a letter or word. Not surprisingly then, experts suggest that even 95% accuracy results in hundreds or thousands of errors in a single book consisting of 100,000 or more characters – and this 95% accuracy is measured by character, not word, which means that the accuracy rate is even lower when measured by words. As a result, some studies suggest that manual entry may be faster and cheaper for large projects. There are apparently numerous companies in India that will transcribe large projects.

Thus the need for our grand crowdsourcing experiment, even with printed texts.

If anyone else has specific requests or other tips or thoughts, let us know in the comments.



