Two months without a Devonthink post?!?

My Devonthink posts, marginally related to EMEMH proper I admit, continue to attract readers, and even comments/questions. So an update is in order.

To frame the discussion, here’s a reminder of what any note-taking setup should allow you to do:

  1. Collect: Amass all your documents, notes, thoughts, writings and images into a single interface. The wider the variety of files you can view, the better. The wider the variety of documents you can search with the same keywording scheme, the better.
  2. Summarize: Distinguish original documents from your notes on those originals. Provide a way to connect summaries to the originals.
  3. Sort: Organize your data in sortable lists, usually using metadata fields (date, author, place…) as the sorting variable.
  4. Search: Find specific documents quickly via title, author, and other metadata, or by navigating through a folder hierarchy. Find specific text strings within any documents. More powerful software (like DTPO) will allow proximity and fuzzy searching.
  5. Cite: Provide a place to store bibliographic information about each source, so that it can be cited when used.

[Reminder: I haven’t fully implemented all of my features, so some of the screenshots may show outdated features. Still a work in progress.]

COLLECT
Thus far the biggest advantage of DTPO for me is a rather mundane one: collecting all my sources (images, texts) in one place, in one program that can display them all, and with a consistent metadata scheme I can use to find specific sources (or categories of sources) quickly. Some statistics illustrate the scope of my needs. My main database, descriptively-entitled ‘WSS’, is a behemoth. For those interested in the details:

My main DTPO database's properties

My main DTPO database’s properties

I have over 26,000 individual documents, totaling 80 GB of space. As you can see, 15,000 of these are image-only PDFs, which means their content is not searchable, though their file names and metadata are. Nonetheless, merely storing the 1000s of PDFs in DTPO is a big improvement from my prior system of nested file folders. I can now also import all my images into DTPO (actually I try to convert them to PDFs to use that file format’s expanded metadata). Thus all my documents on the 1708 siege of Lille are in the same place (in the Lille 1708 siege group, duh), whether they be siege plans, siege accounts, my notes and questions about the siege and its sources… Previously I went to Picasa to search for images, separate from my texts.

Screenshot of Lille 1708 group

Screenshot of Lille 1708 group (obviously I have far more on Lille, just not in the right group yet)

I’ve named my groups such that the group ‘Lille1708’ already has the year in its name (so a saved search looking for *1708 will find it), and the Lille1708 group is nested within the ‘SiegesLC’ group, so a search for L(ow)C(ountries) will pull it up, along with records from the ‘BattlesLC’ group, and so on.

Ideally many of these manuscript PDFs will eventually be transcribed to take advantage of DTPO’s AI – since they are handwritten, OCR won’t work. It is easy enough to transcribe an image file of a manuscript document into a rich text file, with two windows open. It also helps if you have a big monitor.

Transcribing with DTPO

Transcribing with DTPO

But all this transcription will take TONS of time, and I need to produce scholarship in the meantime. Therefore I’ve chosen to import everything I can into DTPO as quickly as possible – putting the originals in tags, and keeping my processed, or at least parsed, content for the groups (see my recent caveat though). The content of full-text documents (stored in the tags) are fully searchable, which is a major improvement on its own, and you can also take advantage of the AI’s See Also and Classify features with these documents. You can automate this searching by adding saved searches (“smart groups”) inside any group or tag, for example one that will list all textual documents dated 1708 that include a text string such as “Lille.” ‘One thought, one note’ makes a huge difference here.

For the image-only PDFs, however, I have to rely on the metadata. I generally add enough metadata and tags/grouping in order to narrow down the results I need to wade through. Generally I’ve been parsing each archive PDF (consisting of several hundred archive folios) into separate years as a first step. This simple parsing makes a huge difference since one of my main units of analysis is a combat event, which generally occurs within a specific campaign year. Since most individuals served only in one theater each year, the geographical variable is isolated as well.

Oh Albemarle!

Oh Albemarle!

As you can see in the above screenshot, I also pasted in the British Library MS online catalog entry as a separate document, shown in the floating pop-up window. This is really useful because it provides reference info on a selection of letters, which will appear if I search for a minor figure like “Ittersum.”

You could add a link in the catalog record doc to jump straight to the image PDF, or else Reveal the document in its containing folder. If you were really bored, you could create a link from each catalog entry to the specific page in the image PDF where that document starts. Generally, you can avoid some hyperlinking monotony by simply selecting some text, then right-click to run a search based off your selection. You could also turn this into a key command, obviously.

The unique word count in the WSS database is quite high at 801,000: foreign language words in Anglophone scholars’ publications (including titles in the footnotes and bibliographies) account for some of this, but it also results from the irregular spelling of early modern English, as well as dirty OCR. As an early modern Europeanist, you can’t really avoid any of these, which presents its own problems for DTPO.

Overall, DTPO has made a huge difference as far as collecting my documents in a single place.

SUMMARIZE
Technically, the majority of the documents enumerated above are not, in fact, notes. They are copies of the original primary and secondary sources, rather than processed notes that have been summarized in some way, whether it be through a summary, a paraphrase, an excerpted quote, a keyword, or a comment/thought I’ve added. In one sense the full text versions are less useful than notes based on them because I haven’t condensed down the germane information within them, which means there is a massive amount of text to conceivably wade through. On the other hand, half of the point of having the originals (in DTPO or anywhere else) is because you’ll invariably be returning to the sources with different questions over time, or to look for more context to your fragmentary notes. So too with the image PDFs: in one sense they are even less useful than the full-text originals since you can’t even perform a search on the content; but they are still much more useful than having them buried eight levels deep in a folder hierarchy. More to the point, notes have their own disadvantages: I’m sure I’m not the only one who’s had occasion to rue the day I paraphrased when I should have quoted, nor the only one to be thankful that I photographed the documents in the archives rather than just take notes. And, let’s be honest, I have copies of a few hundred thousand letters, a hundred+ treatises, etc. – there’s no way I’ll ever be able to go through every one of them. Welcome to the historical profession.

I’ve tried to mitigate the problems associated with a predominance of originals over notes by:

  1. Creating a system that will allow me to, over time, take notes on the originals.
  2. Taking advantage, in the meantime, of full-text search capabilities for OCRed published sources.
  3. Assigning metadata to progressively narrow down my search for image PDFs, even if the search isn’t as precise as I’d like it to be.

As described in previous posts (check the Devonthink blog tag), I keep the originals in DT tags, and my notes in the DT groups, where the AI works. If there are image PDFs that are specific to a single group, I can put those in the group regardless of their language or whether they even include text or not – non-text PDFs shouldn’t interfere with the AI (though See Also and Classify oddly suggest ‘similar’ items to PDFs that have no words, which you wouldn’t think possible).

In most cases I like to keep a non-OCRed, image PDF of an original source in my WSS database, as well as putting a full-text OCRed version (say, of a French treatise) in a separate French database, with a link from the one database to the other. This way I can keep all of my originals at hand: the full text of the original is available for string searches, while the original scanned images are also available in case of (inevitable) OCR issues.

The overall organization can be summarized thus:

DTPO notes system (revised)

DTPO notes system (revised)

TAGS for each source (nested within genre and source type). Each source tag contains the image PDF of the original, the full text of the original if it is in English, and a catalog document or other summary description of the document. There’s a link from the catalog/summary doc to the original, as well as from original full-text to original scan (in case I need to check something).

GROUPS for substantive thematic content. The Artillery-Field Train group – nested in the Artillery group – can have note documents from multiple sources, linked back to the page of the original PDF as needed. You can include the full-text of originals if you want (in English only), as long as they are limited to that topic. The group also includes docs with my thoughts about the subject, as well as any image PDFs of illustrations, maps, even parsed original PDFs if you don’t have time to transcribe everything.

You can also include an index document of sorts, listing all of the sources available on this topic, with links to those documents. DT’s Scripting forum has code to automate the creation of such an index document, here. This partly duplicates what you can see in the group’s list view, but you can be more selective in your index.

SEPARATE DATABASES for each language (i.e the full-text sources), to use DTPO’s AI. If you’re like most humanities scholars, and particularly EMEMHians who study subjects who fought in multinational coalition armies, you’re relying on lots of multi-language, full-text, published sources, e.g. Marlborough’s Letters and Dispatches have letters in English and French, and I think some Latin and German too. You wouldn’t be able to use the AI on such a confusing collection, so you need to segregate your text documents into separate language databases, maybe even a (temporary) separate database for multi-language sources like the above-mentioned Letters and Dispatches. However, separate databases is very unnatural when you, for example, summarize part of a document in your native tongue (e.g. English), but include an important quote in the original language (not English). The obvious solution is to create separate databases for each language, so you can use the AI for the French-language sources just like the English-language sources in the WSS database. I’ve been trying to keep as much of the notes in the main database as possible – for multi-language documents this means providing summaries in English, translating key quotes into English (making sure to indicate that it’s my translation, which I try to keep as awkwardly literal as possible to preserve key terms), and then provide a hyperlink to the original, transcribed document in the French database. Not ideal, but I’m hoping some Applescript will hopefully automate a lot of this (not the translation part of course).

English translation of original French document

Overall, my general structure attempts to take advantage of full-text search across multiple languages, and make it easier to find non-transcribed sources quickly.

SORT
Another key purpose of any note-taking system is to make it easy to sort the notes. This is automatically available in the group list view with the use of metadata. You can choose which columns to display, and how you’d like to sort them. You can do the same with search results, which is key, because you often need to sort your search results before you can use them.

I still haven’t settled on the perfect combination of naming conventions and metadata fields, but generally I try to keep the Spotlight Comment field for basic title info, I name the document based on its content when I can, I begin one of the fields with a date YYYY.MM.DD for sorting purposes, and I identify the type of document for granularity in my searches. Document type possibilities, which I store in the Keyword file metadata field, include:

  • orig: document is a copy of the original (either full-text or scanned).
  • note: document is a note on an original (either summary, paraphrase, or quote).
  • thought: document is a thought of mine on the subject or source.
  • link: document is a link to the original, usually in a different database, or to a website.
  • map: document is a map or plan – it needs to be converted to PDF to use the Keyword metadata field to note this.
  • pic: document is a contemporary illustration/artifact.
  • draft: document is a draft, either mine or from a friend.
  • graphic: document is a timeline, diagram, i.e. not a contemporary image.
  • index: document is an index, or linked list to other documents on the topic.
  • summary: document is a summary of the source, including citation, bibliographic info…
DTPO Misc Iberian Ops group

DTPO Misc Iberian Ops group

There are various other metadata fields that you can use, but since they all need to be added by hand for each record, I’m not sure yet how useful they are. I have, however, tweaked a number of the DT-provided Applescripts (ok, my programming wife has, while I watched) to do things like copy the file title to the Spotlight Comment, so I can then rename the file with a title that describes the content of the document rather than its provenance. (Remember to include this title summary in the content of your document if you want the AI to use it.) You can also use the columns to sort your search results in the advanced Search window.

A new advance: I finally figured out how to replicate my old Access report, which lists each document on topic X, sorted by date, regardless of language. In DT this requires a smart group (saved search) that searches across multiple databases:

By Day smart group
By Day DTPO smart group

My old Access report, for comparison:

Access report on Hulst 1702
By Day Access report on Hulst 1702

On the one hand, DT’s smart group doesn’t display the content of each document on the same page (i.e. at the same time, for comparison). But on the other hand, descriptively-named documents would be even better in list format, once I name them all. And as you’ve already seen, you can always open each document in a separate pop-up window if you’d like. That’s one more Access feature I’ve been able to replicate in DT.

SEARCH
DT’s search capabilities have already been addressed in a number of contexts: fuzzy search, proximity search, wildcards, Similar Words, See Also and Classify… At this stage DT doesn’t have a lot of flexibility in its metadata, which is often an under-appreciated aspect of note-taking, but it’s adequate for most purposes.

So here I’ll just add some statistics to give you a sense of the speed of DTPO. Speed can have a qualitative impact on your research, especially as you multiply the number of searches you perform. Despite 80 GB and 27,000 documents, DTPO runs fine on my 4GB 1600MHz MacBook Air (compared to my much more powerful 24GB RAM iMac desktop). In fact, it’s usually faster for me to simply search for a keyword in the title of a document I’m searching for, rather than navigate through the nested hierarchies. Searching the 59 million words in the 8,755 text files in the database is relatively quick. To compare it with a search (of all the databases on my main desktop) that I did in a previous post:

Search iMac MBA
Hits Time Hits Time
“I do not know” 92 0.234 sec 129 1.722 sec
“I do not know” fuzzy 94 0.27 sec 131 1.278 sec
“do not” within 10 words of “know” 432 0.36 sec 509 0.09 sec
“do not” before “know” 379 0.18 sec 730 0.034 sec

As an aside, large text files require much more processing power, e.g. opening a 5 MB text file may take a few seconds, whereas opening a 100 MB image file happens almost instantly. So if I had 15,000 rtf’s instead of 8,000, the searches would take a bit longer. Nevertheless, it’s pretty speedy.

CITE
As I’ve mentioned before, DTPO doesn’t have any capabilities akin to Zotero, that automates downloading metadata for publications, although there is some Applescript out there that could import Zotero records into DT (not sure if it still works or not). In any case, you can add a properly-formatted citation of each source in the catalog/summary doc in each tag. You could then quickly find it with the “summary” Keyword field. I still need to figure out how best to import all my Secondary data from my old Access database.

CONCLUSION
So overall I’m happy with the transition, even though I know there are certain potential projects that won’t be possible with DTPO, particularly measuring correspondence with statistics, or tracking all the details of correspondence with metadata alone. DT’s focus on textual content is clearly optimized for thematic topics. Yet all in all, DTPO makes it much easier for me to find most things than in my previous system, and I can view PDFs, image files (which I try to convert to PDFs), text files, etc. all in the same interface, and with consistent metadata.

It’s a good start, and the possibility of further enhancements via Applescript augurs well for the future. So I guess that’s my New Year’s resolution.

Advertisements

Tags: ,

2 responses to “Two months without a Devonthink post?!?”

  1. jostwald says :

    For those interested in easily copying just the structure of your database to a new one, i.e. replicating the Group structure from one database (say the WSS db) to another (say the French db), the DT Scripting forum has the script here: http://forum.devontechnologies.com/viewtopic.php?f=20&t=11596
    You’ll want to wait until your group structure is pretty well set, or else you’ll need to worry about version control issues between the various databases.
    I’d also suggest you use the same group names in all the databases (e.g. all the group names in English), if you want to make your cross-database searches easy.

  2. jostwald says :

    Should’ve also mentioned that there should be a link from every original full-text document (e.g., in the French database) back to the original image PDF (in the WSS database). That way every full-text source has a quick jump back to the original. [1/8/14: Revised diagram above]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: