No, seriously, it’s fun

Data entry is so much fun that I’ve spent the past month doing it – mostly transitioning from PC to Mac and from PDFs in folders+Access database to importing everything into Devonthink Pro Office (though I’m still not happy about the lack of Access on the Mac). I’ve been a bit obsessive about the process, so I’ve put off a few other things, this blog included. That’s what summer is about I guess.

It’s still a work in progress, but the results thus far? DTPO’s easy database summary gives some detail on my late obsession: my main DTPO database (English-language only, with almost none of my primary source transcripts from Access) currently measures 32 GB, with 22,000 documents and 300 topical groups and subgroups (think nested keyword folders). Of those 22,000 docs, 7,000 are rich text docs (a few primary sources but mostly OCRed secondary sources) and 14,000 PDFs (600 archival photo collections and contemporary published primary sources, the rest newspaper images). Those 7,000 text documents contain 31 million words (410,000 unique words). Once I figure out how best to import all of my notes from my Access database, that should add another 5.7 million words – right now they’re sitting all bunched up in a separate DTPO database, one giant document for each year. And I still have a few thousand PDFs to import as well, mostly pamphlets and other published accounts from the period (Google Books, Gallica, EEBO/ECCO…).

Seems like a lot of info, and it is. But the point of DTPO is to search through a lot of text. To give an idea of the application’s capabilities and speed (on my 2012-model 27″ iMac with 2.9 GHz processor and 24 GB RAM), I ran a search for the phrase “I do not know” in 67 million words (all seven databases). In 0.234 seconds it found 92 documents containing that text string, in 0.27 seconds a fuzzy search of the same phrase found 94 items. In 0.36 seconds it found 432 documents that had the string “do not” within ten words of the word “know” (aka proximity search); in 0.18 seconds it told me there are 379 documents that have the string “do not” ten words or less before the word “know” (proximity search in a specific direction). Its Concordance feature also let me know that the word “know” appears 340 times in those 31 million words. You can combine these text searches with tagging, grouping and metadata: in 0.068 seconds I found the 35 documents that are both tagged as Secondary sources (3130 docs) and include the phrase “I do not know” in the content. Throw in wildcards, boolean search terms and parenthetical nesting – it’s got it all.

DTPO also includes what its creators refer to as “AI” (artificial intelligence), although it’s really an algorithm(s) for determining similarity between documents based solely off of their textual content – there’s a related Classify feature that will group similar documents together based off of their content. The recipe is a secret, but they say it looks at word frequencies and patterns in the relationships between words within documents. Two examples of this feature can be seen in this screen shot of the search window below:

DTPO Similar Words

A search for the text string know results in 1148 documents (including any word with ‘know’ in it somewhere: knowledge, unknown…). The bottom part of the main search window shows the first instance in the selected document (unfortunately no KWIC layout here). Click on the Similar Words button and after waiting 3 seconds a ‘drawer’ on the right opens up with more info provided by the AI. The top is a list of other words that are somehow “similar to” the text string “know” – not sure if this is more of a co-occurrence feature (frequently found with, and presumably in close proximity to, the search string), or whether it’s intended more as a synonym finder (which it seems to fail at). The bottom part shows other words that might be the same text string but misspelled – similar to the fuzzy search feature I think, and very useful for OCRed text. [Sidenote: Right now my database consists of entire articles and book chapters; the AI should become more accurate as I eventually parse them into smaller, more meaningful chunks of text.]

In other words, DTPO can take a lot of words and search through them pretty darn quickly, in several different ways.

For those curious about Devonthink Pro Office for historical research, there will be blog posts. But in the meantime, for those who don’t know how to use Google, here are a few historians (and other academics) who’ve blogged about using DT, in an introductory sort of way:

But be ye warned!

First, if you want to consider a program like DT, RTFM! Several times, if you’re seriously considering the product. And don’t throw it away after you’ve started using the program – return to the manual again and again as your knowledge of the program grows.

Second, judge carefully what you read online, especially from bloggers. As historians, we should already know that we need to be really careful about chronology, and this is also true when researching software online. Features change over time, sometimes significantly, and usually in the direction of adding more. This is particularly important for software like DTPO, which has been around for perhaps a decade (I think) and gone through two major versions, adding numerous features in the process. Yet most blog commentary on DT dates from 2011 or earlier, before version 2 was released. But you don’t need to be a historian to know that you need to take claims with a grain of salt: in blog posts and comments people often assert that software X cannot do task Y, even when it can – I’ve seen this half-a-dozen times with DT. Sometimes blogger ignorance is as much to blame as outdated blog posts. So caveat emptor.

More details to come. Needless to say I’m trying to use DTPO as efficiently as possible, though I am hampered a bit by the disconnect between DTPO and relational databases. The shift from the more structured relational database format to a more freeform text database makes importing tens of thousands of records challenging. As a first (temporary) step, I’ve just done a brute-force data dump from my Access database – the text is searchable by the above methods even if the metadata isn’t included.

Oh yeah, there’ll be future posts on martial music, and that whole Churchill thing.

Advertisements

Tags: ,

One response to “No, seriously, it’s fun”

  1. Pamela Toler (@pdtoler) says :

    I’m looking forward to reading your posts on Devonthink. I’m slowly finding my way through the program.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: