Organizing with Devonthink
Yet more Devonthink.
The advantages of Devonthink Pro Office for me compared to my old Access database are straightforward:
- DTPO collects all my research documents in one place, making it easy to view them all together. This includes PDFs (both image and OCRed), jpgs, Word docs, rich text docs, Excel files, web pages (you can easily clip items from your web browser), emails, RSS feeds (in fact, Skulking posts get automatically fed into DTPO), not to mention many other file types. It’s so liberating to simply press the down arrow key to shift from viewing an image PDF to reading a text doc, rather than hunting in the file viewer and waiting for Word, Adobe Acrobat, Photoshop… to open. You can also open up multiple windows, for example, to transcribe a manuscript or compare documents. Ideally I’d have all of my sources and notes already in text format, but that’ll never happen: 6,661 text files currently in DTPO, but 15,596 image PDFs in my main database, 68 GB overall. (FWIW you can achieve some of this with OS X’s Preview/Quick Look.)
- DTPO allows me to group and tag documents regardless of their file format. I can search these varied documents with a consistent syntax within a single interface. Previously, I had an excellent keywording system in my Access database, but it only included text documents, not the thousands of PDFs I had in various folders. (FWIW you can achieve some of this functionality by simply tagging your individual files and using Spotlight searches to find them.)
- DTPO allows for rapidly robust text searches, including wildcards, nesting terms, proximity searches, as well as fuzzy searches. Its AI (Classify, See Also, Auto Group) will also identify textual documents that share similar word patterns, which eliminates much of the need to manually categorize sources. Related words are also identified with its Similar Words command. These search features are critical to my DTPO usage.
- DTPO allows for easy hyperlinking (or wikilinking if you prefer) between RTF and other documents, so you can link notes to original sources, textual notes to the PDF documents they sprang from, etc. Your text note can include a hyperlink that will whisk you to the exact PDF page of the original document. A shortcut or back arrow will take you right back to the note.
- DTPO allows easy syncing between my iMac desktop and MacBook Air laptop, so that all of my sources and notes and thoughts are available wherever I go. There’s even an iPad version of DT, although it’s received mixed reviews.
Less important advantages of Devonthink Pro Office for me (i.e. things I could already do, or probably won’t do much in DT but the capability exists):
- DTPO converts file formats (e.g. from PDF to rich text, Word to rich text…).
- DT’s Pro Office version comes with ABBYY FineReader to OCR imported documents. I usually use the full FineReader 11 Pro for PC because I want more control over the results.
- DTPO can run a web server so you can share your databases online.
- DTPO runs Applescripts and Perl scripts, and supposedly Python scripts can be run within Applescripts. Something I might look into for the far future, assuming I ever learn Python.
So much for the main advantages.
But this is old hat if you’ve read previous posts on DT. If you aren’t a computer power user, follow the straightforward DT organizations described on other blogs. They’re simple setups, even if they don’t take advantage of some of DT’s more powerful features, i.e. the AI. The rest of this post provides a more powerful (and complicated) variation on how to make efficient use of DT’s myriad organizational structures for historical research. If you’re not familiar with DT already and you want to be, go read some of the previous posts on the software, check out the forum on the Devonthink website, and buy the Take Control of Getting Started with Devonthink 2 book. Reading through the PDF manual wouldn’t hurt either. If you can’t or won’t RTFM, then you should stop reading now and stick with what you know.
Organizing in Devonthink
Although I can’t fully replicate all of the functionality of Access, I have distributed the most important pieces of document metadata among the various DT “organizing metaphors” (I have no idea what to call these thingies). At a minimum, good historical notes need a document name, its content and/or notes on the document, the source information on the document (where it came from or its provenance), and a way to attach keywords. DT allows you to organize all this information using five basic attributes of each document: two hierarchical structures native to DT, Groups and Tags; document name conventions; the operating system’s default metadata fields; and freeform keyword text strings within the content. You can also combine naming conventions with metadata fields and Groups and Tags, e.g. add a prefix to tags indicating that they are subjects (su:Siege), or people (au:Marlborough)…
The most noteworthy aspects of DT’s organization are its Groups and Tags. Groups are simply DT “folders” where you place related documents. Groups are used by DT’s Classify feature to suggest a category for a selected document; in essence Classify says the current document’s content is similar to the content of documents in the pre-existing Group X (and Y and Z). You can quickly use the Classify feature to move imported documents from the Inbox to their appropriate groups, using Classify’s Move To command. Auto Classify automates categorization into pre-existing groups if you trust the AI – it’s usually pretty good. In short, the Classify command gives you a quick way to assign a bunch of diverse documents to groups you’ve already created.
There is a related AI feature called See Also, which works on the document rather than group level – it compares the selected document not against the documents according to their groupings, but against all of the documents in your database. It also differs from Classify in that it does not move the selected document anywhere. See Also says the current document is similar to this list of other documents. This means that See Also relies completely on the AI to detect patterns without your input. The Auto Group command operates on the same principle, creating new Groups and moving the selected documents to them. With Classify, on the other hand, the AI looks for similar documents based off of the groupings you’ve defined as meaningful, and allows you to quickly move them to a group.
Both Classify and See Also are useful and you can use both at the same time. Personally, I want more control over which documents get grouped together and I want to quickly jump to a list of all the documents on topic X, rather than plow through an ad hoc list of documents that might (but might not) relate to topic X. This means I definitely want to use Classify, which means I need to use Groups effectively. See Also helps after you’ve already classified some documents, to see possible connections to other documents that you might have overlooked.
So let’s see how this relates to organizing your DT database. Most historical users of DT (at least those that blog) use Groups to record provenance, mimicking the file folder hierarchy where we used to store our documents. This has been indirectly encouraged by the DT developers, who repeat the “whatever works for you” line (you know how much I hate that), and who insist that Tags and Groups are essentially the same thing. DT may indeed treat them similarly behind the scenes, and they can both be organized into multi-level hierarchies.
But there’s one huge difference between the two: the Classify AI only works on Groups, not Tags. That means that we should organize our information by whichever Groups we want the AI’s help with. So we should make our Groups thematic if we want the AI to help us organize our documents thematically. But wait – don’t we naturally think of tags as being topical? Yes, and therein lies the problem. To get the most from DT, you have to upend the conventional thinking about groups and tags: Groups are for topics, Tags are for provenance. Not the other way around. Why? Because you want the AI to tell you, “Here are other documents on siege tactics,” i.e. this document’s content shares similar word patterns to the content of documents in the Siege tactics group. If you use a Siege tactics tag, you can search for that tag, but using topical tags won’t help you find more siege-related documents with the AI. And using Groups to record the source of your documents cripples the Classify AI. Classifying by provenance, e.g. archive volume, makes little sense: it does no good to learn that document X is similar to the documents in the group Additional MSS 61334. First, you already know the provenance of every document (you can’t cite it otherwise), and second, the documents in most archive volumes and source collections encompass a wide variety of topics, so the AI won’t be able to find a meaningful pattern when one document is a letter about domestic politics and another details siege tactics. Sure, See Also still works on a document-level, but you have to wade through what the computer sees as patterns, rather than give it a hint by categorizing some of the documents on your own. And if you need to repeat the search again, you’ll have to wade through all those false hits one more time. Use groups for themes, topics and subjects.
Why then do all the historians in the blogosphere do the opposite? As far as I can guess, provenance Groups are so popular because:
- Groups can have hierarchies and so look just like our familiar nested provenance file folder hierarchies.
- Tags didn’t exist in the earlier version of DT so early adopters were stuck with Groups.
- When Tags were first incorporated into DT, the blogosphere wasn’t aware that Tags can also be hierarchical.
- See Also does provide AI capability, and that’s still better than most other available software.
Since Tags have hierarchies, and that’s really all you need for provenance, use Tags for provenance. Since you presumably don’t want to slog through a list of similar documents (found via See Also) more than once, use Classify to group these related documents into more permanent thematic categories, and make it simple to move new documents to the appropriate group or groups (Move To command, or Replicate). You can still use See Also on any documents you want. Tags are also good for provenance info because they can be quickly batch edited, unlike all the metadata fields (excepting Spotlight Comments) – this makes importing much easier. And Tags are flexible as well: any specific document can have any number of Tags (even Tags residing in different hierarchies) – though for that matter, any document can be in multiple Groups too: technically you duplicate/replicate a document to do so.
To repeat: Use provenance Tags as a parallel yet intersecting hierarchy with topical Groups. If you have the group list pane organized by type, the Tags will sort at the top and the Groups underneath.
Tags, like Groups, also have other advantages over the other organizational techniques. Hierarchical nesting also allows you to kill several tagging tasks with one stone: assign a document the tag “Add61123” or the group “Marlborough as Diplomat” and DT will automatically add their respective parent tags and parent groups. Pittis’ Two Campaigns is not only a poem, but it’s also published, and a primary source to boot.
So now you can interpret this screenshot:
As for metadata in DT, they’re not as useful as you might think, certainly not as useful as they could be. DT uses the OS’s default metadata, which means that each file type has different metadata fields available. This is one reason why the best default file type for DT is rich text, because it has seven usable metadata fields plus the Spotlight Comments field. (The email file type has ideal metadata fields for correspondence, but you can’t create email documents within DT, only import them from an email program.) Unfortunately, if you have a lot of keyword fields and want to use more than just RTF files, you can really only use the fields that are usable by both file types – unless you want to search, e.g. date of publication, in two different fields depending on file type. This limits it to only a couple of fields. Metadata fields are also a pain because they are external to DT – I can’t figure out an easy way to batch edit RTF fields, neither inside nor outside of DT. Conceptually, metadata fields should include permanent information about the document as a whole – metadata fields would be the perfect place for provenance information. Metadata fields would also be perfect for sorting in the document list pane. But without batch editability that means manual entry, which is impractical with tens of thousands of records. If I could quickly and easily change mulitple documents’ metadata fields, however, I’d consider dropping my ‘tags are for provenance’ line – though the ‘groups are for topics’ still holds. But at this stage DT doesn’t have a flexible enough approach regarding metadata to take advantage of its potential.
Naming conventions are another strategy to attach information to documents. You can add prefixes or suffixes, which I half-heartedly do (see dates below). Another venerable technique is to title a document with its source or provenance, making it easy to skim through a folder list of file names and locate the proper one. Some historians follow the same concept in DT.
My problem with putting the provenance info in the file name is that it’s a waste of the name field, i.e. it doesn’t tell you anything substantive about the content of the document, only where the document came from. Assuming you use a variety of sources, it’s impossible to remember the content of every archive volume, assuming they even cover a single topic. Provenance naming conventions are particularly problematic because when you search for documents or when you look at the Classify/See Also drawer, the main result you are presented with is the file name. You can see this by looking at the See Also drawer in the screenshot above – a long list of document names that only tell you where the documents come from, not what they actually say. So you have to click on each of them in turn and read their content just to see whether they are relevant or not. That takes a lot of time.
If you haven’t gone through all those image PDFs of your archival originals, you may not be able to name your documents with a substantive title. But as you take notes, you should be applying one of the most important lessons of note-taking: when trying to identify relevant documents, summaries are quicker to search through than reading through the content. Compare the titles in the following search results to see the difference:
So you’re looking for documents that include the text string “bribery”. What do the results say about bribery? If you only have provenance names, then you’ll have to click and read all 109 of them to find out. But notice how a few of the file names at least give you a sense of what they contain – the seventh actually tells you that it’s referring to a specific incident where the French were accused of bribing foreign princes. That’s actually useful and specific, and you can tell right away if it’s relevant or not. Provenance names are ok for unprocessed image PDFs, but once you start taking notes you should name your note documents with substantive file names. Totally granular!
I haven’t really used keywords within the content, as distinct from the patterns DT’s AI determines. You could add unique text strings, e.g. “Keyword:Siege-Tactics” and find all such keyworded documents with a text string search. But this is a bit harder than it seems, since DT only searches for case-insensitive alphanumeric characters (i.e. the colon and dash in the above example wouldn’t actually be found). I’m also hesitant to adopt this strategy wholesale since I don’t know what exact effect adding such keywords would have on the AI, largely because DT won’t tell us how exactly its AI works.
There are a few other features you can use in a limited way. Each file can have a color-coded Label (7 possible choices), which I tend to use for projects (i.e. as a third cross-cutting categorization), or else when I need to parse a document into multiple files. You can also mark files with a flag. But since labels and flags only have a few possible states, their use is limited. I tend to use them for administrative purposes.
Another implication of the AI that deserves mention: create a separate database for each language. It’s far from ideal, but it’s the only way to take advantage of DT’s ability to identify similar documents by their content. The software is language-neutral after all – it doesn’t know that c-h-i-e-n means the same thing as d-o-g. You should also replicate your group (and tag) hierarchies in each database. Not only does a single group hierarchy across databases allow you to compare Classify results across databases, but since you can also run a search across databases (and save cross-database searches as smart groups), you will be searching and limiting your results with the same terms. All the same, I’d still like to have as many documents as possible in a single database, so I put all the non-text PDFs (photos and scans of manuscript documents, old books that can’t be OCRed) in my main English-language database, regardless of their original language. In theory they won’t throw off the AI; although some of them must have some metadata text somewhere because the AI still makes suggestions in the Classify/See Also drawer. When I take notes on the image PDFs, I do it in English (translating quotes as needed) so the text can be combined with all the other English documents.
Next up in the DTPO series: Entering data.