Devonthink update

In case anybody wonders whether DTPO can handle large databases, here are the current stats for my main WSS database:

WSS db as of 2015.01.05

WSS db as of 2015.01.05

Its large size means if you restart your computer and reopen the database(s) frequently (which I don’t), it takes several minutes each time. Presumably the size of the database is also why anytime I right-click on text in a record, it will take about 13 seconds before the contextual menu pops up. FWIW, I also tend to have ten databases open at the same time – the second-largest one weighing in at 26 GB and 69 million words. So perhaps my databases are a bit larger than the average bear’s. Which is why I upgraded my 256 GB MacBook Air to a 480 GB Jetdrive.

To rerun my previous benchmark searches (on my 2012-model 27″ iMac with 2.9 GHz processor and 24 GB RAM):

“I do not know”: Old – 67m words in main database and 7 databases open: 0.234 seconds finds 92 documents. Now – in 10 databases totaling 119 million words (main database of 74 million words): 250 items in 0.997 seconds.

“I do not know” fuzzy: 67m words and 7 databases open: 94 items in 0.27 seconds. 10 databases open: 259 items in 0.873 seconds.

Know NEAR “do not”: 67m words and 7 databases open: 432 items in 0.36 seconds. 10 databases open: 564 items in 0.77 seconds.

“do not” BEFORE know: 67m words and 7 databases open: 379 items in 0.18 seconds. 10 databases open: 1172 items in 0.373 seconds.

Concordance: “know” appears 41,610 times.

So searching across 119 million words in 10 databases, each of those searches still take less than a second. I can live with that.

The database generally works fine and hasn’t crashed in a long time. I sync back and forth between my iMac and MacBook Air all the time, with rarely a hitch. Though you should be careful and not accidentally delete a tag with any documents within it (and then empty the trash), unless you want those documents to disappear permanently – I’d guess this is what the developers meant when they warned that documents residing only in tags are more ephemeral than those in groups. I caught the accidental deletion before emptying the Trash, but for some reason I thought that if I synced with an earlier database on my MacBook Air, sync would replace the deleted copies rather than propagate the deletions on the MBA. Needless to say, I was displeased to find the sync deleted tagged documents on my other copy of the database. Since sync didn’t work as I expected, I had to restore a recent backup and rebuilt it for good measure (I probably should have rebuilt the database anyway, given its size and how many alterations I make to it). The lesson: play close attention to your Trash. Of course I assume you always Verify and Repair, and Backup and Optimize, don’t you?

The 74 million words are a result of not only early modern English (irregular spelling especially), but also the fact that I haven’t been nearly separatist enough in segregating the English documents from the French. The AI works pretty well nonetheless, though I think the algorithms can get a bit confused if they focus on rare words like place names, which encourages the AI to make geographical links to other documents, whereas I’m more likely to want thematic linkages (same theater, but one document is on small war and another on a siege). Especially since most of my archives are organized by geography – all of the documents in AG A1 1937 are about the Flanders theater (and take place during 1706), so I hardly need the AI to tell me that. But the AI has definitely been worth it overall.

The Notecard-an-image-PDF has been a true godsend – I use it all the time on both primary and secondary sources, image and text PDFs.

I also have a few thousand more PDFs because I’ve started making a subtag in the provenance tags to keep parsed individual pages of each text PDF – you can find an Automator script online to automate that process for you. Originally I created a separate database for those, but I’m not sure that it’s worth it.

Parsed subtag

Parsed subtag

As mentioned before, this makes it much easier to do a proximity search and not have to search through dozens of single hits in a multi-page text PDF before you get to the document’s page where the proximity hit is located. But I also keep a copy of the original, multi-page image PDF in the database, in case I want to quickly skim across multiple pages, or need to look up a specific page (because I’m not going to rename each single-page PDF with its printed page number). But remember that if you parse your text PDFs into single pages, you might lose proximity hits that span across more than one page. If this is a concern, you could keep a copy of the multi-page text PDF as well. In that case though, you might want to sort your Advanced Search results window by ascending word count, to put your massive text PDFs at the very bottom of your window, to get them out of the way, so you can focus on your smaller documents/notes first. Then you can decide how much time you want to spend plowing through all the single hits before finding the proximity hit in a 400-page document.

Of course if you had consistent delimiters between your various pieces of correspondence (often stretching across more than one page), you could convert the text PDF to Rich Text and then parse by that delimiter. But that would be far too convenient for the unstructured data that is early modern history.

So that’s how I roll…


Tags: ,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: