What this historian needs for his 21st century note-taking system

Back to my continued discussion of Devonthink Pro Office (DTPO). Better late to the party than never I guess. Or at least I’m hoping so.

I’ll keep this post focused on what historians like myself need for historical research. A future post will provide the gory details about how to get there (or pretty close) with DTPO.

The variety of sources, information, and file formats I use are familiar to most historians:

  • paper and electronic
  • image and text
  • typescript and manuscript
  • secondary and primary
  • bibliographic metadata and content.

On the PC I have the relational database Access with 10,000s of sources, which are searchable in a myriad of ways. But I also have computer file folders filled with PDFs of images of handwritten archival documents, as well as image PDFs of early modern books that are non-OCRable, freshly-scanned photocopies of archival documents, not to mention three-ring binders and hanging file folders full of handwritten notes and thoughts.

If not already born-digital, paper can be converted to digital form via scanning or photographing or transcribing. But since I have a large legacy collection (1,000s of images of handwritten documents), and since I am at different stages of note-taking with different documents and collections, Access alone can’t meet my needs. Thus my new note-taking system needs to accommodate a variety of documents in various states of processing:

  • some documents have already been transcribed into Access with multiple keywords
  • others consist only of placeholder records with a few keywords and some metadata on the source
  • still others I’ve categorized only by putting the PDFs into year folders (sometimes with a hyperlink within Access)
  • with others I have only the vaguest sense of what they contain, e.g. who they’re from and when they were written.

I thus need to handle all of the following types of info:

  • Catalog or inventory information on each archive, archive collection, and archive volume (either from paper catalog or from online databases). This information is important when first starting your research, or when you need to refer back to something. It’s useful to have this information in digital form so it is easily searchable.  Two examples:
    Blenheim Papers catalog entry

    Blenheim Papers catalog entry, scan of paper copy

    • Title: Vol. XXIII (ff. 197). 1703-July 1705.includes:ff. 15′, 20 Robert Harley, 1st Earl of Oxford: Letters to A. Maynwaring: 1704, 1712.ff. 15*, 20 Arthur Maynwaring, alias Mainwaring; MP: Letters to: 1704-1712.f. 50 Army of Mecklenburg-Schwerin: Mem
    • Collection Area: Western Manuscripts 
    • Reference: Add MS 61123 
    • Creation Date: 1703-Jul 1705 
    • Extent and Access: 
      Extent: 1 item 
    • Language: English 
      French 
    • Contents and Scope: 
      Contents:
      Vol. XXIII (ff. 197). 1703-July 1705.
      includes:

      • ff. 15′, 20 Robert Harley, 1st Earl of Oxford: Letters to A. Maynwaring: 1704, 1712.
      • ff. 15*, 20 Arthur Maynwaring, alias Mainwaring; MP: Letters to: 1704-1712.
      • f. 50 Army of Mecklenburg-Schwerin: Memorial rel. to: 1704.
      • f. 50 [Hermann Christian?] von Wolffradt, Mecklenburg-Schwerin Envoy in London: Memorial by: 1704.: Extract….
  • An entire volume of (usually handwritten) archival documents. Generally, these are documents that haven’t yet been processed, though ideally they would eventually be transcribed for search. Much more quick to use tags or keywords or group them in various folders, e.g. parse them by year, or by specific event (siege of Douai 1710)… Keeping a copy of the entire volume in its original order and pristine form is preferable. In my pre-existing setup, these files are tagable only, and are in PDF or JPG format.
  • A photographed copy of a specific document. Not searchable (i.e. unable to OCR and I haven’t had time to transcribe them yet), but they are tagable; in PDF or JPG.
  • A transcript of a specific document. Searchable and tagable, in rich text format or in a memo field in my Access database; DOC or RTF or TXT.
  • Notes on specific documents, often a single quote or sentence (the venerable notecard’s “one thought-one note”). Tagable, and searchable unless you have handwritten notes that you’ve scanned in (to save time); usually DOC or RTF.

Relational databases like Access can deal with some of these complications more easily than DTPO: you can have a huge number of fields and tables, all customizable, and each field is treated essentially the same, plus they are all batch editable (via update queries). My basic note-taking form in Access has probably 100 fields all told. This makes it a simple matter to have separate fields for a common document type, say a letter published in a collection of correspondence. Seems simple, but it requires the following information:

  • book title and editor
  • book page number
  • letter number in book
  • archive this letter is taken from (including the archive collection, archive volume, archive folio or number)
  • date of letter (sometimes it might only be a year or a month and year, is the letter date Old Style or New Style?)
  • author of letter
  • recipient of letter
  • place letter written
  • date letter received (OS/NS again)
  • Plus more general info on the document (it’s a Primary source, genre = letter, language = English..)
  • Plus housekeeping info such as the note’s status (finished? to enter? to keyword?…)

Plus the content of the letter itself:

  • transcript
  • summary
  • paraphrase
  • various keywords associated with the document, including topics and proper nouns mentioned
  • the date the content refers to (which isn’t necessarily the same as the date the letter was written – the date of content is particularly important if you have a siege account that spans several days, or a letter that discusses a week’s worth of activity)

In other words, multiple versions of authors, citation info, and especially dates. These fields are in addition to other important system info, like the date the individual database record was created, when it was modified, and many other fields working behind the scenes. My main Access entry form that records all this info:

Notes database, 2012 version

Notes database, 2012 version

You can then search and sort by all of these fields: find me every document written by a Dutchman to an Englishman, before 1708, that mentions the word “decisive” relating to either Blenheim or Ramillies, sorted chronologically. You can quickly perform calculations: how many documents were written in each year meeting the above criteria? You can perform batch find-replace operations, you can assign the values of certain fields based off the values in other fields, you can send the results to a report for export. Incredibly flexible because you literally start from scratch creating your database from the ground-up. Most of the power of relational databases for us humanists derives from their under-appreciated ability to add bazillions of short keywords (drawn from lookup lists). Yet keywording all of these documents by hand takes a lot of time, too much if you have thousands. And the database is getting rather sluggish with 40,000+ records. And it can’t search through all that text very quickly. And it’s stuck on my aging PC desktop.

Devonthink, on the other hand, is optimized to work with 10,000s of very long full-text documents, 31 million words in my main database at the moment. But because DT is more of a freeform text database than a relational database, it doesn’t deal with lots of keywords or metadata very well.  Some DT users avoid the frontloaded overhead of tags/keywords altogether and just throw all their textual documents into DT and use its search and Classify/See Also features (described in a future post). That strategy is very quick and useful, and it may be the best option for people who haven’t worked much with database design and software optimization. But such a strategy a) won’t work for any non-textual documents, of which I have thousands, nor b) will it allow you to do filtered searches using other metadata as limiters. I must use keywords and metadata, and I’ll bet I’m not the only one. Yet the ability for DT’s AI to automatically classify documents by topic is one of the big reasons why I switched to DT – I was overwhelmed with how much reading and manual keywording I’d have to do with my current Access database. So my DT system has to be a bit more complicated. How complicated? Let’s save that for another post.

Advertisements

Tags:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: