Two months without a Devonthink post?!?
My Devonthink posts, marginally related to EMEMH proper I admit, continue to attract readers, and even comments/questions. So an update is in order.
To frame the discussion, here’s a reminder of what any note-taking setup should allow you to do:
- Collect: Amass all your documents, notes, thoughts, writings and images into a single interface. The wider the variety of files you can view, the better. The wider the variety of documents you can search with the same keywording scheme, the better.
- Summarize: Distinguish original documents from your notes on those originals. Provide a way to connect summaries to the originals.
- Sort: Organize your data in sortable lists, usually using metadata fields (date, author, place…) as the sorting variable.
- Search: Find specific documents quickly via title, author, and other metadata, or by navigating through a folder hierarchy. Find specific text strings within any documents. More powerful software (like DTPO) will allow proximity and fuzzy searching.
- Cite: Provide a place to store bibliographic information about each source, so that it can be cited when used.
[Reminder: I haven’t fully implemented all of my features, so some of the screenshots may show outdated features. Still a work in progress.]
Thus far the biggest advantage of DTPO for me is a rather mundane one: collecting all my sources (images, texts) in one place, in one program that can display them all, and with a consistent metadata scheme I can use to find specific sources (or categories of sources) quickly. Some statistics illustrate the scope of my needs. My main database, descriptively-entitled ‘WSS’, is a behemoth. For those interested in the details:
I have over 26,000 individual documents, totaling 80 GB of space. As you can see, 15,000 of these are image-only PDFs, which means their content is not searchable, though their file names and metadata are. Nonetheless, merely storing the 1000s of PDFs in DTPO is a big improvement from my prior system of nested file folders. I can now also import all my images into DTPO (actually I try to convert them to PDFs to use that file format’s expanded metadata). Thus all my documents on the 1708 siege of Lille are in the same place (in the Lille 1708 siege group, duh), whether they be siege plans, siege accounts, my notes and questions about the siege and its sources… Previously I went to Picasa to search for images, separate from my texts.
I’ve named my groups such that the group ‘Lille1708’ already has the year in its name (so a saved search looking for *1708 will find it), and the Lille1708 group is nested within the ‘SiegesLC’ group, so a search for L(ow)C(ountries) will pull it up, along with records from the ‘BattlesLC’ group, and so on.
Ideally many of these manuscript PDFs will eventually be transcribed to take advantage of DTPO’s AI – since they are handwritten, OCR won’t work. It is easy enough to transcribe an image file of a manuscript document into a rich text file, with two windows open. It also helps if you have a big monitor.
But all this transcription will take TONS of time, and I need to produce scholarship in the meantime. Therefore I’ve chosen to import everything I can into DTPO as quickly as possible – putting the originals in tags, and keeping my processed, or at least parsed, content for the groups (see my recent caveat though). The content of full-text documents (stored in the tags) are fully searchable, which is a major improvement on its own, and you can also take advantage of the AI’s See Also and Classify features with these documents. You can automate this searching by adding saved searches (“smart groups”) inside any group or tag, for example one that will list all textual documents dated 1708 that include a text string such as “Lille.” ‘One thought, one note’ makes a huge difference here.
For the image-only PDFs, however, I have to rely on the metadata. I generally add enough metadata and tags/grouping in order to narrow down the results I need to wade through. Generally I’ve been parsing each archive PDF (consisting of several hundred archive folios) into separate years as a first step. This simple parsing makes a huge difference since one of my main units of analysis is a combat event, which generally occurs within a specific campaign year. Since most individuals served only in one theater each year, the geographical variable is isolated as well.
As you can see in the above screenshot, I also pasted in the British Library MS online catalog entry as a separate document, shown in the floating pop-up window. This is really useful because it provides reference info on a selection of letters, which will appear if I search for a minor figure like “Ittersum.”
You could add a link in the catalog record doc to jump straight to the image PDF, or else Reveal the document in its containing folder. If you were really bored, you could create a link from each catalog entry to the specific page in the image PDF where that document starts. Generally, you can avoid some hyperlinking monotony by simply selecting some text, then right-click to run a search based off your selection. You could also turn this into a key command, obviously.
The unique word count in the WSS database is quite high at 801,000: foreign language words in Anglophone scholars’ publications (including titles in the footnotes and bibliographies) account for some of this, but it also results from the irregular spelling of early modern English, as well as dirty OCR. As an early modern Europeanist, you can’t really avoid any of these, which presents its own problems for DTPO.
Overall, DTPO has made a huge difference as far as collecting my documents in a single place.
Technically, the majority of the documents enumerated above are not, in fact, notes. They are copies of the original primary and secondary sources, rather than processed notes that have been summarized in some way, whether it be through a summary, a paraphrase, an excerpted quote, a keyword, or a comment/thought I’ve added. In one sense the full text versions are less useful than notes based on them because I haven’t condensed down the germane information within them, which means there is a massive amount of text to conceivably wade through. On the other hand, half of the point of having the originals (in DTPO or anywhere else) is because you’ll invariably be returning to the sources with different questions over time, or to look for more context to your fragmentary notes. So too with the image PDFs: in one sense they are even less useful than the full-text originals since you can’t even perform a search on the content; but they are still much more useful than having them buried eight levels deep in a folder hierarchy. More to the point, notes have their own disadvantages: I’m sure I’m not the only one who’s had occasion to rue the day I paraphrased when I should have quoted, nor the only one to be thankful that I photographed the documents in the archives rather than just take notes. And, let’s be honest, I have copies of a few hundred thousand letters, a hundred+ treatises, etc. – there’s no way I’ll ever be able to go through every one of them. Welcome to the historical profession.
I’ve tried to mitigate the problems associated with a predominance of originals over notes by:
- Creating a system that will allow me to, over time, take notes on the originals.
- Taking advantage, in the meantime, of full-text search capabilities for OCRed published sources.
- Assigning metadata to progressively narrow down my search for image PDFs, even if the search isn’t as precise as I’d like it to be.
As described in previous posts (check the Devonthink blog tag), I keep the originals in DT tags, and my notes in the DT groups, where the AI works. If there are image PDFs that are specific to a single group, I can put those in the group regardless of their language or whether they even include text or not – non-text PDFs shouldn’t interfere with the AI (though See Also and Classify oddly suggest ‘similar’ items to PDFs that have no words, which you wouldn’t think possible).
In most cases I like to keep a non-OCRed, image PDF of an original source in my WSS database, as well as putting a full-text OCRed version (say, of a French treatise) in a separate French database, with a link from the one database to the other. This way I can keep all of my originals at hand: the full text of the original is available for string searches, while the original scanned images are also available in case of (inevitable) OCR issues.
The overall organization can be summarized thus:
TAGS for each source (nested within genre and source type). Each source tag contains the image PDF of the original, the full text of the original if it is in English, and a catalog document or other summary description of the document. There’s a link from the catalog/summary doc to the original, as well as from original full-text to original scan (in case I need to check something).
GROUPS for substantive thematic content. The Artillery-Field Train group – nested in the Artillery group – can have note documents from multiple sources, linked back to the page of the original PDF as needed. You can include the full-text of originals if you want (in English only), as long as they are limited to that topic. The group also includes docs with my thoughts about the subject, as well as any image PDFs of illustrations, maps, even parsed original PDFs if you don’t have time to transcribe everything.
You can also include an index document of sorts, listing all of the sources available on this topic, with links to those documents. DT’s Scripting forum has code to automate the creation of such an index document, here. This partly duplicates what you can see in the group’s list view, but you can be more selective in your index.
SEPARATE DATABASES for each language (i.e the full-text sources), to use DTPO’s AI. If you’re like most humanities scholars, and particularly EMEMHians who study subjects who fought in multinational coalition armies, you’re relying on lots of multi-language, full-text, published sources, e.g. Marlborough’s Letters and Dispatches have letters in English and French, and I think some Latin and German too. You wouldn’t be able to use the AI on such a confusing collection, so you need to segregate your text documents into separate language databases, maybe even a (temporary) separate database for multi-language sources like the above-mentioned Letters and Dispatches. However, separate databases is very unnatural when you, for example, summarize part of a document in your native tongue (e.g. English), but include an important quote in the original language (not English). The obvious solution is to create separate databases for each language, so you can use the AI for the French-language sources just like the English-language sources in the WSS database. I’ve been trying to keep as much of the notes in the main database as possible – for multi-language documents this means providing summaries in English, translating key quotes into English (making sure to indicate that it’s my translation, which I try to keep as awkwardly literal as possible to preserve key terms), and then provide a hyperlink to the original, transcribed document in the French database. Not ideal, but I’m hoping some Applescript will hopefully automate a lot of this (not the translation part of course).
Overall, my general structure attempts to take advantage of full-text search across multiple languages, and make it easier to find non-transcribed sources quickly.
Another key purpose of any note-taking system is to make it easy to sort the notes. This is automatically available in the group list view with the use of metadata. You can choose which columns to display, and how you’d like to sort them. You can do the same with search results, which is key, because you often need to sort your search results before you can use them.
I still haven’t settled on the perfect combination of naming conventions and metadata fields, but generally I try to keep the Spotlight Comment field for basic title info, I name the document based on its content when I can, I begin one of the fields with a date YYYY.MM.DD for sorting purposes, and I identify the type of document for granularity in my searches. Document type possibilities, which I store in the Keyword file metadata field, include:
- orig: document is a copy of the original (either full-text or scanned).
- note: document is a note on an original (either summary, paraphrase, or quote).
- thought: document is a thought of mine on the subject or source.
- link: document is a link to the original, usually in a different database, or to a website.
- map: document is a map or plan – it needs to be converted to PDF to use the Keyword metadata field to note this.
- pic: document is a contemporary illustration/artifact.
- draft: document is a draft, either mine or from a friend.
- graphic: document is a timeline, diagram, i.e. not a contemporary image.
- index: document is an index, or linked list to other documents on the topic.
- summary: document is a summary of the source, including citation, bibliographic info…
There are various other metadata fields that you can use, but since they all need to be added by hand for each record, I’m not sure yet how useful they are. I have, however, tweaked a number of the DT-provided Applescripts (ok, my programming wife has, while I watched) to do things like copy the file title to the Spotlight Comment, so I can then rename the file with a title that describes the content of the document rather than its provenance. (Remember to include this title summary in the content of your document if you want the AI to use it.) You can also use the columns to sort your search results in the advanced Search window.
A new advance: I finally figured out how to replicate my old Access report, which lists each document on topic X, sorted by date, regardless of language. In DT this requires a smart group (saved search) that searches across multiple databases:
My old Access report, for comparison:
On the one hand, DT’s smart group doesn’t display the content of each document on the same page (i.e. at the same time, for comparison). But on the other hand, descriptively-named documents would be even better in list format, once I name them all. And as you’ve already seen, you can always open each document in a separate pop-up window if you’d like. That’s one more Access feature I’ve been able to replicate in DT.
DT’s search capabilities have already been addressed in a number of contexts: fuzzy search, proximity search, wildcards, Similar Words, See Also and Classify… At this stage DT doesn’t have a lot of flexibility in its metadata, which is often an under-appreciated aspect of note-taking, but it’s adequate for most purposes.
So here I’ll just add some statistics to give you a sense of the speed of DTPO. Speed can have a qualitative impact on your research, especially as you multiply the number of searches you perform. Despite 80 GB and 27,000 documents, DTPO runs fine on my 4GB 1600MHz MacBook Air (compared to my much more powerful 24GB RAM iMac desktop). In fact, it’s usually faster for me to simply search for a keyword in the title of a document I’m searching for, rather than navigate through the nested hierarchies. Searching the 59 million words in the 8,755 text files in the database is relatively quick. To compare it with a search (of all the databases on my main desktop) that I did in a previous post:
|“I do not know”||92||0.234 sec||129||1.722 sec|
|“I do not know” fuzzy||94||0.27 sec||131||1.278 sec|
|“do not” within 10 words of “know”||432||0.36 sec||509||0.09 sec|
|“do not” before “know”||379||0.18 sec||730||0.034 sec|
As an aside, large text files require much more processing power, e.g. opening a 5 MB text file may take a few seconds, whereas opening a 100 MB image file happens almost instantly. So if I had 15,000 rtf’s instead of 8,000, the searches would take a bit longer. Nevertheless, it’s pretty speedy.
As I’ve mentioned before, DTPO doesn’t have any capabilities akin to Zotero, that automates downloading metadata for publications, although there is some Applescript out there that could import Zotero records into DT (not sure if it still works or not). In any case, you can add a properly-formatted citation of each source in the catalog/summary doc in each tag. You could then quickly find it with the “summary” Keyword field. I still need to figure out how best to import all my Secondary data from my old Access database.
So overall I’m happy with the transition, even though I know there are certain potential projects that won’t be possible with DTPO, particularly measuring correspondence with statistics, or tracking all the details of correspondence with metadata alone. DT’s focus on textual content is clearly optimized for thematic topics. Yet all in all, DTPO makes it much easier for me to find most things than in my previous system, and I can view PDFs, image files (which I try to convert to PDFs), text files, etc. all in the same interface, and with consistent metadata.
It’s a good start, and the possibility of further enhancements via Applescript augurs well for the future. So I guess that’s my New Year’s resolution.