Automatically parse your inventories

Historians owe a debt of gratitude to those turn-of-the-century archivists, whose nationalistic yearnings led to the creation of dozens of volumes of archive inventories and catalogs. If you do much work on French military history, you likely know the Inventaire sommaire des archives historiques: archives de guerre. This multi-volume inventory provides short summaries of each volume in the A1 correspondence series, more than 3500 volumes up to the year 1722. That’s a lot to keep track of, as I estimated awhile back. So much in fact, that you’ll likely be going back to that particular well again and again. If so, it might be worth your while to include those details in your note-taking system. Here’s how I did it in DTPO.

First step for any digitization process is to scan in the printed page and convert the page images into text. In ye olden days you had to do it yourself,  and then run it through OCR software. Nowadays it’s more likely that you can download an .epub version of it from Google Books, and then convert it to .txt with the free Calibre software. Worst case, download the PDF version and OCR it yourself.

AG A1 inventaire pdf

Now you find yourself with a text document full of historical minutiae. Import the text (and the PDF, just to be safe) into DTPO. Next, add some delimiters which will indicate where to separate the one big file into many little files. But do it smart, with automation. Open the text document in Word (right-click in DTPO or find it in the Finder), and then start your mass find-replace iterations to add delimiters, assuming there’s a pattern that you can use to add delimiters between each volume. Maybe each volume is separated by two paragraph marks in a row, in which case you would add a delimiter like ##### at the end of ^p^p.  You’ll end up with something like this:

AG A1 inventaire txt delimited

As you can see, the results are a bit on the dirty side – I’ll see if I can get a student worker to clean up the volume numbers since they’re kinda important, but the main text is good enough to yield search results.

Once you’ve saved the doc in Word and returned to DTPO, you can use the DT forum’s Explode with Delimiter script. Check the resulting hundreds of records – if there’s more than a few errors, erase the newly-created documents, fix the problematic delimiters in the original, and reparse. You’ll want to search not only for false positives, i.e. delimiters added where they shouldn’t have been, but also for false negatives, volumes that should have delimiters but were missed. For example, search ^p^# in Word to check for any new paragraphs starting with a number (assuming the inventory starts each paragraph with the volume number).

But wait, there’s more. Once you’ve parsed those, you can even take it a step further. The summaries aren’t usually at the document level, but there is enough detail that it’s worth parsing the descriptions within each volume.  After converting all the parsed txt files to rtf files, move each volume’s document to the appropriate provenance tag/group, and then run another parse on that volume’s record, with the ; as delimiter. In the case above, you might want to also parse by the French open quotation mark, or find-replace the « with ;«. Parsing this volume summary gives you a separate record for each topic within the volume, or at least most of them. With all these new parsed records still selected, convert to rtf and add the provenance info to the Spotlight Comments. Now you’re ready to assign each parsed topic document to whichever topical groups you want.

AG A1 inventaire DTPO

 

It’s not perfect, but it’s pretty darn good considering how little effort it requires; maybe an hour or so gets you 500+ volume inventories in separate records. Now you’ve got all those proper nouns in short little documents, ready to search, (auto-)group and sort.

Advertisements

Tags: ,

6 responses to “Automatically parse your inventories”

  1. jostwald says :

    Two tweaks:
    A. You don’t need to convert each volume txt file to rtf, since the Explode by Delimiter scripts converts the results to txt again. The Explode by paragraph script, however, keeps it as rtf.
    B. Made a macro to automate the last phase. Once you have the .txt file for a single volume in the appropriate provenance tag/group (i.e. before you parse by semicolon), you can have a macro:
    1. Parse the selected volume document by the delimiter ; – Execute Applescript and paste in the Explode by delimiter script, hardcoding the semicolon, rather than using a dialog box.
    2. Select Menu Item: Data-Convert-to Rich Text
    3. Select Menu Item: View-Sort-by Kind
    4. Type Down Arrow (goes to first txt file)
    5. Type Shift-Option-down arrow (selects all txt files)
    6. Type Forward-Delete key to delete txt files
    7. Type Shift-Option-up arrow (selects all rtf files)
    8. Type Shift-Command-I to open Info window
    9. Tab to Spotlight Comments field, where you can enter the generic provenance info for the entire selection.

    Insert pauses to taste.

  2. Joe says :

    HI,
    I can’t seem to find the Explode by Delimiter script you refer to on the Devonthink forum. Do you have a link? Thanks.

    • jostwald says :

      Here’s the code:
      tell application "DEVONthink Pro"
      set theSelection to the selection
      set theGroup to current group
      if theSelection is {} then error "Please select some contents."
      display dialog "Enter the desired text delimiter (or nothing to break at each paragraph):" default answer "" buttons {"OK"} default button 1
      set SplitPointRegEx to text returned of the result
      ---set SplitPointRegEx to "AAAAA"
      if SplitPointRegEx is equal to "" then set SplitPointRegEx to ASCII character 10
      set OldDelimiters to AppleScript's text item delimiters
      repeat with CurrentItem in theSelection
      set AppleScript's text item delimiters to SplitPointRegEx
      set theSource to the plain text of CurrentItem
      set RepeatCount to 0 as integer
      set TotalCount to (count each text item of theSource) as integer
      repeat until RepeatCount is equal to TotalCount
      set RepeatCount to RepeatCount + 1
      set CurrentText to (text item RepeatCount of theSource)
      if length of CurrentText is greater than 0 then
      set txtLength to (length of CurrentText)
      if txtLength < 100 then
      set CurrentName to (texts 1 through txtLength of CurrentText as string)
      else
      set CurrentName to (texts 1 through 100 of CurrentText & "..." as string)
      end if
      create record with {name:CurrentName, type:txt, plain text:CurrentText} in theGroup
      end if
      end repeat
      end repeat

      set AppleScript's text item delimiters to OldDelimiters
      end tell

    • jostwald says :

      You can also find versions of the code here.

  3. jostwald says :

    Additional tip: if you parse the document-level descriptions within each volume, consider numbering the parsed documents to maintain their order (there’s also code that does that), in case the document descriptions actually tell you in which part of the volume they are found (even if no document numbers are given).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: