Automatically parse your inventories
Historians owe a debt of gratitude to those turn-of-the-century archivists, whose nationalistic yearnings led to the creation of dozens of volumes of archive inventories and catalogs. If you do much work on French military history, you likely know the Inventaire sommaire des archives historiques: archives de guerre. This multi-volume inventory provides short summaries of each volume in the A1 correspondence series, more than 3500 volumes up to the year 1722. That’s a lot to keep track of, as I estimated awhile back. So much in fact, that you’ll likely be going back to that particular well again and again. If so, it might be worth your while to include those details in your note-taking system. Here’s how I did it in DTPO.
First step for any digitization process is to scan in the printed page and convert the page images into text. In ye olden days you had to do it yourself, and then run it through OCR software. Nowadays it’s more likely that you can download an .epub version of it from Google Books, and then convert it to .txt with the free Calibre software. Worst case, download the PDF version and OCR it yourself.
Now you find yourself with a text document full of historical minutiae. Import the text (and the PDF, just to be safe) into DTPO. Next, add some delimiters which will indicate where to separate the one big file into many little files. But do it smart, with automation. Open the text document in Word (right-click in DTPO or find it in the Finder), and then start your mass find-replace iterations to add delimiters, assuming there’s a pattern that you can use to add delimiters between each volume. Maybe each volume is separated by two paragraph marks in a row, in which case you would add a delimiter like ##### at the end of ^p^p. You’ll end up with something like this:
As you can see, the results are a bit on the dirty side – I’ll see if I can get a student worker to clean up the volume numbers since they’re kinda important, but the main text is good enough to yield search results.
Once you’ve saved the doc in Word and returned to DTPO, you can use the DT forum’s Explode with Delimiter script. Check the resulting hundreds of records – if there’s more than a few errors, erase the newly-created documents, fix the problematic delimiters in the original, and reparse. You’ll want to search not only for false positives, i.e. delimiters added where they shouldn’t have been, but also for false negatives, volumes that should have delimiters but were missed. For example, search ^p^# in Word to check for any new paragraphs starting with a number (assuming the inventory starts each paragraph with the volume number).
But wait, there’s more. Once you’ve parsed those, you can even take it a step further. The summaries aren’t usually at the document level, but there is enough detail that it’s worth parsing the descriptions within each volume. After converting all the parsed txt files to rtf files, move each volume’s document to the appropriate provenance tag/group, and then run another parse on that volume’s record, with the ; as delimiter. In the case above, you might want to also parse by the French open quotation mark, or find-replace the « with ;«. Parsing this volume summary gives you a separate record for each topic within the volume, or at least most of them. With all these new parsed records still selected, convert to rtf and add the provenance info to the Spotlight Comments. Now you’re ready to assign each parsed topic document to whichever topical groups you want.
It’s not perfect, but it’s pretty darn good considering how little effort it requires; maybe an hour or so gets you 500+ volume inventories in separate records. Now you’ve got all those proper nouns in short little documents, ready to search, (auto-)group and sort.