Tag Archive | Devonthink

Automating Newspaper Dates, Old Style (to New Style)

If you’ve been skulking over the years, you know I have a sweet spot for Devonthink, a receptacle into which I throw all my files (text, image, PDF…) related to research and teaching. I’ve been modifying my DTPO workflow a bit over the past week, which I’ll discuss in the future.

But right now, I’ll provide a little glimpse into my workflow for processing the metadata of the 20,000 newspaper issues (yes, literally 20,000 files) that I’ve downloaded from various online collections over the years: Google Books, but especially Gale’s 17C-18C Burney and Nicholls newspaper collections. I downloaded all those files the old-fashioned way (rather than scraping them), but just because you have all those PDFs in your DTPO database, that still doesn’t mean that they’re necessarily in the easiest format to use. And maybe you made a minor error, but one that is multiplied by the 20,000 times you made that one little error. So buckle up as I describe the process of converting text strings into dates and then back, with AppleScript. Consider it a case study of problem-solving through algorithms.

The Problem(s)

I have several problems I need to fix at this point, generally falling under the category of “cleaning” (as they say in the biz) the date metadata. Going forward, most of the following modifications won’t be necessary.

First, going back several years I stupidly saved each newspaper issue by recording the first date for each issue. No idea why I didn’t realize that the paper came out on the last of those dates, but it is what it is.

Screen Shot 2014-03-09 at 7.53.14 PM

London Gazette: published on Dec. 13 or Dec. 17?

Secondly, those English newspapers are in the Old Style calendar, which the English stubbornly clung to till mid-century. But since most of those newspapers were reporting on events that occurred on the Continent, where they used New Style dates, some dates need manipulating.

Automation to the Rescue!

To automate this process (because I’m not going to re-date 20,000 newspaper issues manually), I’ve enlisted my programmer-wife (TM) to help me automate the process. She doesn’t know the syntax of AppleScript very well, but since she programs in several other languages, and because most programming languages use the same basic principles, and because there’s this Internet thing, she was able to make some scripts that automate most of what I need. So what do I need?

First, for most of the newspapers I need to add several days to the listed date, to reflect the actual date of publication – in other words, to convert the first date listed in the London Gazette example above (Dec. 13) into the second date (Dec. 17). So I need to take the existing date, listed as text in the format 1702.01.02, convert it from a text string into an actual date, and then add several days to it, in order to convert it to the actual date of publication. How many days exactly?

Well, that’s the thing about History – it’s messy. Most of these newspapers tended to be published on a regular schedule, but not too regular. So you often had triweekly publications (published three times per week), that might be published in Tuesday-Thursday, Thursday-Saturday, and Saturday-Tuesday editions. But if you do the math, that means the Saturday-Tuesday issue covers a four-day range, whereas the other two issues per week only cover a three-day range. Since this is all about approximation and first-pass cleaning, I’ll just assume all the issues are three-day ranges, since those should be two-thirds of the total number of issues. For the rest, I have derivative code that will tweak those dates as needed, e.g. add one more to the resulting date if it’s a Saturday-Tuesday issue, instead of a T-R or R-S issue. If I was really fancy, I’d try to figure out how to convert it to weekday and tell the code to treat any Tuesday publication date as a four-day range (assuming it knows dates before 1900, which has been an issue with computers in the past – Y2k anyone?).

So the basic task is to take a filename of ‘1702.01.02 Flying Post.pdf’, convert the first part of the string as text (the ‘1702.01.02’) into a date by defining the first four characters as a year, the 6th & 7th characters as a month…, then add 2 days to the resulting date, and then rename the file with this new date, converted back from date into a string with the format YYYY.MM.DD. Because I was consistent in that part of my naming convention, the first ten characters will always be the date, and the periods can be used as delimiters if needed. Easy-peasey!

But that’s not all. I also need to then convert that date of publication to New Style by adding 11 days to it (assuming the dates are 1700 or later – before 1700 the OS calendar was 10 days behind the NS calendar). But I want to keep the original OS publication date as well, for citation purposes. So I replace the old OS date on the front of the filename with the new NS date, and append the original date to the end of the filename with an ‘OS’ after it for good measure (and delete the .pdf), and Bob’s your uncle. In testing, it works when you shift from one month to another (e.g. January 27 converts to February 7), and even from year to year. I won’t worry about the occasional leap year (1704, 1708, 1712). Nor will I worry about how some newspapers used Lady Day (March 25) as their year-end, meaning that they went from December 30, 1708 to January 2, 1708, and only caught up to 1709 in late March. Nor does it help that their issue numbers are often wrong.

I’m too lazy to figure out how to make the following AppleScript code format like code in WordPress, but the basics look like this:
–Convert English newspaper Title from OSStartDate to NSEndDate & StartDate OS, +2 for weekday
— Based very loosely off Add Prefix To Names, created by Christian Grunenberg Sat May 15 2004.
— Modified by Liz and Jamel Ostwald May 26 2017.
— Copyright (c) 2004-2014. All rights reserved.
— Based on (c) 2001 Apple, Inc.

tell application id “DNtp”
try
set this_selection to the selection
if this_selection is {} then error “Please select some contents.”

repeat with this_item in this_selection

set current_name to the name of this_item
set mydate to texts 1 thru ((offset of ” ” in current_name) – 1) of current_name
set myname to texts 11 thru -5 of current_name

set newdate to the current date
set the year of newdate to (texts 1 thru 4 of mydate)
set the month of newdate to (texts 6 thru 7 of mydate)
set the day of newdate to (texts 9 thru 10 of mydate)

set enddate to newdate + (2 * days)
set newdate to newdate + (13 * days)
tell (newdate)
set daystamp to day
set monthstamp to (its month as integer)
set yearstamp to year
end tell

set daystamp to (texts -2 thru -1 of (“0” & daystamp as text))
set monthstamp to (texts -2 thru -1 of (“0” & monthstamp as text))

set formatdate to yearstamp & “.” & monthstamp & “.” & daystamp as text

tell (enddate)
set daystamp2 to day
set monthstamp2 to (its month as integer)
set yearstamp2 to year
end tell

set daystamp2 to (texts -2 thru -1 of (“0” & daystamp2 as text))
set monthstamp2 to (texts -2 thru -1 of (“0” & monthstamp2 as text))

set formatenddate to yearstamp2 & “.” & monthstamp2 & “.” & daystamp2 as text

set new_item_name to formatdate & myname & ” ” & formatenddate & ” OS”
set the name of this_item to new_item_name

end repeat
on error error_message number error_number
if the error_number is not -128 then display alert “DEVONthink Pro” message error_message as warning
end try
end tell

So once I do all those things, I can use a smart group and sort the Spotlight Comment column chronologically to get an accurate sense of the chronological order in which publications discussed events.

This screenshot shows the difference – some of the English newspapers haven’t been converted yet (I’m doing it paper by paper because the papers were often published on different schedules), but here you can see how OS and NS dates were mixed in willy-nilly, say comparing the fixed Flying Boy and Evening Post with the yet-to-be-fixed London Gazette and Daily Courant issues.

DTPO Newspapers redated.png

Of course the reality has to be even more complicated (Because It’s History!), since an English newspaper published on January 1, 1702 OS will publish items from continental newspapers, dating those articles in NS – e.g., a 1702.01.01 OS English newspaper will have an article dated 1702.01.05 NS from a Dutch paper. So when I take notes on a newspaper issue, I’ll have to change the leading NS date of the new note to the date on the article byline, so it will sort chronologically where it belongs. But still.

There’s gotta be a better way

In preparation for a new introductory digital history course that I’ll be teaching in the fall, I’ve been trying to think about how to share my decades of accumulated computer wisdom with my students (says the wise sage, stroking his long white beard). Since my personal experience with computers goes back to the 80s – actually, the late 70s with Oregon Trail on dial-up in the school library – I’m more of a Web 1.0 guy. Other than blogs, I pretty much ignore social media like Facebook and Twitter (not to mention Snapchat, Instagram, Pinterest…), and try to do most of my computer work on a screen larger than 4″. So I guess that makes me a kind of cyber-troglodyte in 2017. But I think that does allow me a much broader perspective of what computers can and can’t do. One thing I have learned to appreciate, for example, is how many incremental workflow improvements are readily available, shortcuts that don’t require writing Python from the terminal line.

As a result, I’ll probably start the course with an overview of the variety of ways computers can help us complete our tasks more quickly and easily, which requires understanding the variety of ways in which we can achieve these efficiencies. After a few minutes of thought (and approval from my “full-stack” computer-programming wife), I came up with this spectrum that suggests the ways in which we can make computers do more of our work for us. Toil, silicon slave, toil!

Computer automation spectrum.png

Automation Spectrum: It’s Only a Model

Undoubtedly others have already expressed this basic idea, but most of the digital humanities/digital history I’ve seen online is much more focused on the extreme right of this spectrum (e.g. the quite useful but slightly intimidating Programming Historian) – this makes sense if you’re trying to distantly read big data across thousands of documents. But I’m not interested in the debate whether ‘real’ digital humanists need to program or not, and in any case I’m focused on undergraduate History majors that often have limited computer skills (mobile apps are just too easy). Therefore I’m happy if I can remind students that there are a large variety of powerful automation features available to people with just a little bit of computer smarts and an Internet connection, things that don’t require learning to speak Javascript or Python fluently. Call it kaizen if you want. The middle of the automation spectrum, in other words.

So I’ll want my students, for example, to think about low-hanging fruit (efficiency fruit?) that they can spend five minutes googling and save themselves hours of mindless labor. As an example, I’m embarrassed to admit that it was only when sketching this spectrum that I realized that I should try to automate one of the most annoying features of my current note-taking system, the need to clean up hundreds of PDFs downloaded from various databases: Google Books, Gale’s newspaper and book databases, etc. If you spend any time downloading early modern primary sources (or scan secondary sources), you know that the standard file format continues to be Adobe Acrobat PDFs. And if you’ve seen the quality of early modern OCR’d text, you know why having the original page images is a good idea.

But you may want, for example, to delete pages from PDFs that include various copyright text – that text will confuse DTPO’s AI and your searches. I’m sure there are more sophisticated ways of doing that, but the spectrum above should prompt you to wonder whether Adobe Acrobat has some kind of script or macro feature that might speed up deleting such pages from 1,000s (literally) of PDF documents that you’ve downloaded over the years. And, lo and behold, Adobe Acrobat does indeed have an automation feature that allows you to carry out the same PDF manipulation again and again. Once you realize “there’s gotta be a better way!”, you only need to figure out what that feature is called in the application in question. For Adobe Acrobat it used to be called batch processing, but in Adobe Acrobat Pro DC such mass manipulations now fall under the Actions moniker. So google ‘Adobe Acrobat Actions’ and you’ll quickly find websites that allow you to download various actions people have created. Which allows you to quickly learn how the feature works, and to modify existing actions. For example, I made this Acrobat Action to add “ps” (primary source) to the Keywords metadata field of every PDF file in the designated folder:

Screenshot 2017-05-10 18.52.17.png

I already copied and tweaked macros and Applescripts that will add Keywords to rich text files in my Devonthink database, but this Adobe solution is ideal after I’ve downloaded hundreds of PDFs from, say, a newspaper database.

Similarly, this next action will delete the last page of every PDF in the designated folder. (I just hardcoded to delete page 4, because I know newspaper X always has 4 pages – I can sort by file size to locate any outliers – and the last page is always the copyright page with the nasty text I want to delete. I can, for example, change the exact page number for each newspaper series, though there’s probably a way to make this a variable that the user can specify with each use):

Screenshot 2017-05-10 18.52.43.png

Computers usually have multiple ways to do any specific task. For us non-programmers, the internet is full of communities of nerds who explain how to automate all sorts of software tasks – forums (fora?) are truly a god-send. But it first requires us to expect more from our computers and our software. For any given software, RTFM (as they say), and then check out the software’s website forum – you’ll be amazed at the stuff you find. Hopefully all that time you save from automation won’t be spent obsessively reading the forum!

Devonthink update

In case anybody wonders whether DTPO can handle large databases, here are the current stats for my main WSS database:

WSS db as of 2015.01.05

WSS db as of 2015.01.05

Read More…

Devonthink Usage Scenario – Notecarding an image PDF

My post ideas are usually extremely long and involved, which means I have a few dozen drafts that aren’t finished. So I’ll take a different tack for DT and just include a series of short-ish post on how I’m using DT now, showing a variety of usage scenarios with screen shots. 1100 words isn’t particularly short for a blog, but it’s my blog.

Unfortunately nobody that I know of has come up with a typology of the types of notes one might take, beyond the barebones. So I’m calling this one the RTF-notecard-from-specific-page-of-image-PDF technique. Not quite ‘flying crane’, but I lack the Buddhist monks’ combination of wisdom and careful observation of the natural world. This post largely explains the process that replaces what I described in an earlier post, with thanks to korm on the DT support forum for the Applescript which I then tweaked.

Say you’ve got a PDF of a primary source without text (OCR doesn’t work) in DT. It could be a scanned volume of archival documents, could be an old book.

1. I open the PDF in a separate window, move and resize the window to fill more than half the screen, and zoom in to a comfortable reading level.

2. Start reading.

3. When I come across something that is worth taking note of, I take note of. Specifically, I select the page: either Cmd-A, with the focus in the page (not the thumbnails), or just drag across the page. You don’t need to actually select any text per se, which helps because there isn’t any text in an image-only PDF.

4. Then I invoke the New RTF from Selected Text with URL to PDF macro (Ctl-Opt-Cmd-P for me), as discussed in the aforementioned post. This prompts you to title the new document.

Name it and you have power over it

Name it and you have power over it

I overwrite the default (the name of the original PDF file), and instead use a substantive title, like an executive summary of the point being made, e.g. Tutchin says the French are morons. This popup window is really helpful because it forces you to make a summary. Remember that efficient note taking requires a brief summary, which relieves you from having to reread the same quote (possibly several sentences or even a paragraph) every time you need to figure out what it says. One of the most useful examples is how naming your files by summary makes it much easier to plow through Search results when you’re performing a needle-in-a-haystack search.

Impact of Naming conventions on Search results

Impact of Naming conventions on Search results (from a previous post)

In needle-in-a-haystack searches most notes aren’t what you’re looking for – you need a quick way to discard false hits. In many other instances you’re looking for a specific variation on a theme – you need a quick way to distinguish similar items. Thus, a summary title allows you to quickly see that a specific note isn’t on the right topic; it similarly allows you to quickly find a certain variation on the general theme of French stupidity, for example. Having columns to sort the search results by would also facilitate this.

5. After I’ve named the RTF note and hit Enter, I’m prompted to send it to a particular group

Select group (not tag)

Select group (not tag)

For the purposes of speed I usually just default to the Inbox by pressing Return and then use the Auto-Classify to help me process them (in the Inbox) in a single session. But you could, if you want, find the proper group (not tag however), and then that will be the default group from then on. Usually, though, the same PS will be addressing different topics, which would require navigating my 1000s of groups in that tiny little window. So I go for speed at this phase.

Then the code does more magic. It adds a link from the original PDF to the new RTF note (in the URL field, which is the blue link at the top of the RTF). This allows you to jump back to the original whenever you want. The code also copies the title of the PDF file to the Spotlight Comments of the new RTF field (Bonus material: I use the Spotlight Comments as another place to put the provenance info – that way if I ever need to cite a specific file, I can just select the record in DT’s list pane, Tab to the Spotlight Comments field, Copy the already-selected text and then paste it elsewhere). The code also opens up the new RTF in its own window (which you may need to relocate/resize), and pastes the file name into the content of the RTF file. I do that last step because the AI only works on alphanumeric characters within the file, not the file name or other metadata.

Original and notes side by side

Original and notes side by side

6. Now the blinking cursor is in the RTF, with the original image visible, just waiting for your input. You can make further notes and comments, or transcribe however much of the PS you desire.

7. Then you add additional tags or groups in the Tag bar of the RTF (Ctl-Tab from the content pane). You can also run Auto-Classify (the magic hat) if you want to move it to a different group, or have other suggested groups that you then manually enter in. (Remember that Auto-Classify moves the record to a different group, so don’t use it if you’ve gone to the trouble of already selecting a group in step 5).

8. When you’re all done with this single notecard, close it. Now you’re back to the original PDF where you left off. Continue your reading and repeat the process to your heart’s content.

9. If you send all your RTF notes to the Inbox, you’ll need, at some point, to go to the Inbox and assign the notecards RTFs to groups, either with Auto-Classify or by assigning your own tags. If you manually add tags to files in the Inbox, their file names will turn red (indicating there are aliases – aliasi? – in several groups). You’ll then need to get them out of the Inbox (reduce clutter) by dragging them to the Untagged group you’ve already created, then run the Remove Tags from Selection macro on the selected Untagged files.

Untagged

Untagged

All this may sound complicated at first, but it becomes second nature once you’ve done it a few times, and once you understand how Devonthink works in general. The busy work of opening and tagging and such only takes a few seconds per note – certainly no slower than writing a physical notecard.

Taking notes in Devonthink

Short post as I have several research projects that need to finish up before school starts in two weeks.

With help from some code on the DT forum (and my programming wife), I finally managed to come up with a smooth workflow for taking notes. I have literally 1000s of PDFs that I need to take notes on – a quote here, a paragraph there, my disapproval noted elsewhere. DT comes with an Annotation script that will create a new document (linked back to the original) that you can then take notes in. I don’t use it because (as far as I can tell) you can only have one Annotation document for each PDF. Since I am a member of the Cult of The One (Thought, One Note), that won’t work for me.

So as I would come across a salient point in a PDF, I’d do the following:

  1. Copy Page Link for the page of interest
  2. Create a new RTF
  3. Name the file with a summary of the point being made
  4. Tab to the Spotlight Comment and type/paste the citation info (even though I still use tabs for provenance info, I always include the cite info in the comments)
  5. Jump to the body of the RTF to type ‘###’
  6. Select this ### string
  7. Add a Link from that ### back to the original PDF page. It’s always good to have original (co)ntext at hand.
  8. Then start typing my notes.

Needless to say, this takes many steps – I made it a bit shorter with macros, but not short enough. Read More…

The deceptive nature of note-taking

In any note-taking system, having lots of categories (or “fields” in database parlance) in which to record information about sources (aka meta-data) is important. Really important. For historians, simple tags aren’t enough, nor is relying on the full-text alone, nor is a single hierarchical organizational scheme. What on earth do I mean? Keep reading.

Let’s say I’m looking at the question of how early moderns perceived military deception, trickery, subterfuge, lying, and the like. You find a bunch of germane sources and take notes on their discussions of the subject. To simplify matters, we’ll stay theoretical, focusing on how contemporaries expressed their preferences for either deception or straightforwardness in war, rather than using other measures (the frequency of their actual reliance on such sly stratagems, etc.). Perhaps you try to get a good balance of sources, including theoretical discussion of the use of stratagems as well as reactions to specific examples of deception on campaign.

Deception

Deception, thy name is Megatron

So you’ve found dozens or hundreds of quotes from contemporaries – verily, the digital age has given early modern historians (military ones at least) a bounty of evidence to sift through! You have plenty of notes on other topics as well, so you need to identify those specific to deception for future reference. You could take the shortest route and simply tag them all with a “deception” tag. Assuming you want to examine whether contemporaries considered deception laudable or execrable, you could make your life easier and further sort your quotes into two bins based on the stance presented in each source: those quotes that support the use of deception (pro-deception) and those that are against it (anti-deception). Or maybe there are three possible values for this variable: pro, con, and ambivalent/neutral. Heck, maybe add in a separate value for those sources that you checked which you’d think should discuss the utility of deception but actually don’t – perhaps that’s noteworthy. So we have one Stance category with four possible values: pro, con, neutral/ambivalent, or not discussed. Call the Stance variable a group, a tag, a field, a category, or what have you. Note-takers don’t appear to have a standardized vocabulary for such things.

Thus tagged, you can now easily find those hundreds of examples in just about any note-taking system worth its salt. But historians, at least systematic ones, want to look for patterns within that data. A whole range of questions start to bubble up. Maybe you want to see if particular types of people shared the same view on the subject? Or whether geography or chronology (or both) played a role in shaping opinions on the morality of deception? Maybe you wonder whether these contemporary prescriptions and proscriptions were universal, or did contemporary judgments depend on the situation? Maybe you speculate that there are finer gradations within the pro-deception camp? Figuring out all of these questions require slicing the data in a number of different ways. To do this, you need additional categories to differentiate your notes on deception.

  1. Let’s say you’re looking at the question primarily from the perspective of two groups, say, the English and the French (see how we’re staying totally hypothetical here?). So maybe you want to see if the English were more likely than the French to express their opposition to the use of trickery. (If you wanted to count up the quotes, you could even get all statistical, in a cross tab-y sort of way: nominal variable x nominal variable.) So we should add another variable, call it Side, to the Stance measure. That’s two variables to keep track of for each quote.
  2. But “Side” isn’t really specific enough. What we probably mean by “Side” is actually the side of the author of the quote. This assumes, of course, that the author’s point of view is represented by the quote, i.e. the quote isn’t part of a dialogue where one or both characters might not even represent the author’s position, or the author isn’t playing devil’s advocate in the text, or presenting the opposition’s case before rebutting it… Add a SideOfAuthor category.
  3. And if we have an SideOfAuthor field, we probably also need to note when a French author is taking about the French, versus when he might be talking about the English (maybe he’s even talking about a specific Englishman). Perhaps authors saw a difference between the two worth noting? Make it three variables (Stance, SideOfAuthor, SideOfSubject) for our note-taking system, or four for greater precision (Stance, SideOfAuthor, SideOfSubject, SubjectPerson).
  4. But maybe you perceive that the English seem more likely to talk about deception when the French are doing the deceiving – now we’re combining the previous categories. Cynical by nature, you wonder if the French do the same – seeing the deceptive dustmote in the English eye while ignoring the lying log in their own. Is it generally true that each side tends to downplay its own deceptive qualities and highlight those of its enemies, or is there a shared preference for (or against) deception in war? You could figure it out by looking for all the cases where a French author discusses a French subject, and when an English author discusses with an English subject. To be certain, you’d need to examine all four cells of what I like to call “the box.”
    The "Box"

    The “Box”

    To quickly sort all those hundreds of quotes into the four cells, you need those categories. In case you want to look at more than two sides – throw in the Dutch for good measure – you’ll convert your SideOfAuthor variable from a binary one (possibilities being French or English) to one allowing three or more options. If you want to make it a bit easier, you could add another binary variable, “Self”, which includes possible values of “self”, i.e. the author’s own side, and “other,” i.e. not the author’s side. That way, if you want to query whether authors were self-serving or not, you can simply use the Self variable, rather than test all the pairs of SideOfSubject and SideOfAuthor (SoS=French and SoA=French, SoS=English and SoA=English, SoS=Dutch and SoA=Dutch…).
    But just because you’ve aggregated upward, don’t throw out the more specific SideOfSubject variable. Keep it so you can, for example, group together all the non-English authors’ views of the English if you were looking at more than two countries – to find out if everybody else is ganging up on the poor ol’ English (SoS=English and SoA=not English). Or maybe you’re interested to see if one nationality is more likely to talk about themselves (or the Other) than the Other.

  5. We might want to add yet another variable, what I vestigially refer to as EventID. This could be composed of another generic-specific variable pair: a specific combat field (the specific variable) and a related field for combat type (a generic variable based off of the specific combat). There should be a field indicating the type of combat (deception in a surprise attempt, deception in a battle, deception in a siege…). Is deception more acceptable if it leads to a battle than if it leads to a siege? The specific variable in the pair would be a specific combat (the battle of Ramillies, the siege of Lille, the surprisal of Ghent…). Keeping paired general-specific variables gives you flexibility and scalability.
    Other plausible variables could be derived from this Event category. Perhaps contemporaries were pure functionalists: they praised (actual) cases of deception – regardless of who deceived whom – when they succeeded, and excoriated liars only when they were caught out? Or maybe there was a gradual shift in how a specific example of deception was viewed – maybe the French surprisal of Ghent in 1708 was declared base treachery when the English first learned of it, but within a few years the English had gained perspective and come to appreciate the cunning French trick? Perhaps you could examine those curious cases where you know deception was used at a specific event yet its use was not mentioned in a particular source (Stance=not discussed)…

So now we have a good setup, needing at least five different variables to answer our constellation of questions surrounding early modern views on military deception: Stance, SideOfAuthor, SideOfSubject, Self, and Event. Each quote needs the answers to these five questions. And that doesn’t include all the other variables that are associated with these five categories. To give just one example, each author has various attributes that might be germane in addition to their Side: maybe opinions on the utility of deception vary by one’s generational status, by an author’s military experience, by the theaters he fought in or the other authors that author read… Lots of metadata to keep track of.

Is this note-taking scenario over the top? Not at all. Not if historians want to base their arguments on more than a random or superficial glance at the sources. Not if historians want to avoid magically finding only evidence that confirms their preconceptions. And not if historians want to make use of the range of sources now easily accessible. In fact, the above example is even simpler than the reality. We could start by simply adding further possible values to existing variables. For the Self category, it would probably be important to appreciate that an English author could be talking about an Englishman (maybe need to break that up further into Tory vs. Whig, or Court-Country, as well as which the author is), or this English author could be referring to an enemy (who could be the hated French, or maybe the despised-yet-want-to-be-trading-partners-with-them-all-the-same Spanish, or the problematic Protestant Hungarians rebelling against England’s Catholic Austrian ally), or perhaps the English author actually was referring to an ally (Prince Eugene), or perhaps even discussing a neutral (Charles XII of Sweden in his concurrent Great Northern War, or maybe the less-sympathetic Terrible Turks), or maybe it’s a historical reference to Caesar’s use of deception. So at the least, possible Self values could include: own side, enemy, ally, neutral, historical. I can certainly imagine a scenario where an author would opportunistically treat his friends and allies more gently than his enemies, but his friends more kindly even than his allies (or certain kinds of allies, maybe the Protestant ones…). Other structural additions might be needed: some of the variables might have records that require multiple values: maybe it’s noteworthy to see if, whenever English authors discuss their own deceptive practices, they always introduce French subterfuge to muddy the issue? In this case SideOfSubject might need both French and English values for the same quote, or at least a “more than one” value. Does your note-taking system allow this? I hope so. You could plausibly keep throwing in additional variables – the list goes on and on.

Too detailed? Not really. Too specialized to be useful for other topics? Hardly. These categories are not unique to the question of military deception. These categories are, in fact, inherent in just about every possible topic somebody might be discussing: Author A, member of group B (SideOfAuthor), said C (Stance) about person/group D’s behavior (SideOfSubject) regarding topic E (in this case, use of military deception). Whenever more than one type of people (or more than one person) says something about somebody else (or themselves), either in general or relating to a specific instance, and has an opinion on it, you should be tracking these details. You’ll never know if your sources are being self-serving or not without this information. You’ll never know how widespread a particular opinion was without this information. You’ll never know which possible patterns help explain the phenomenon under study without this information. In short, you’ll never really know.

How does this intersect with note-taking? You can mentally assign values to each of these variables every time you read the quote, but note-taking is about summarizing the quote in numerous ways so you don’t need to reprocess it every time. You need categories, ways to organize any single quote, or any list of quotes, by any (or all) of these variables. So you might as well spend a few minutes at the start (after you’re read some sources) figuring out which variables are worth tracking and what their possible values might be (keep room for future changes), and then track them for each record. That’s the note-taker’s way.

Next post: how this all relates to Devonthink. I think.

Automatically parse your inventories

Historians owe a debt of gratitude to those turn-of-the-century archivists, whose nationalistic yearnings led to the creation of dozens of volumes of archive inventories and catalogs. If you do much work on French military history, you likely know the Inventaire sommaire des archives historiques: archives de guerre. This multi-volume inventory provides short summaries of each volume in the A1 correspondence series, more than 3500 volumes up to the year 1722. That’s a lot to keep track of, as I estimated awhile back. So much in fact, that you’ll likely be going back to that particular well again and again. If so, it might be worth your while to include those details in your note-taking system. Here’s how I did it in DTPO.

First step for any digitization process is to scan in the printed page and convert the page images into text. In ye olden days you had to do it yourself,  and then run it through OCR software. Nowadays it’s more likely that you can download an .epub version of it from Google Books, and then convert it to .txt with the free Calibre software. Worst case, download the PDF version and OCR it yourself.

AG A1 inventaire pdf

Now you find yourself with a text document full of historical minutiae. Import the text (and the PDF, just to be safe) into DTPO. Next, add some delimiters which will indicate where to separate the one big file into many little files. But do it smart, with automation. Open the text document in Word (right-click in DTPO or find it in the Finder), and then start your mass find-replace iterations to add delimiters, assuming there’s a pattern that you can use to add delimiters between each volume. Maybe each volume is separated by two paragraph marks in a row, in which case you would add a delimiter like ##### at the end of ^p^p.  You’ll end up with something like this:

AG A1 inventaire txt delimited

As you can see, the results are a bit on the dirty side – I’ll see if I can get a student worker to clean up the volume numbers since they’re kinda important, but the main text is good enough to yield search results.

Once you’ve saved the doc in Word and returned to DTPO, you can use the DT forum’s Explode with Delimiter script. Check the resulting hundreds of records – if there’s more than a few errors, erase the newly-created documents, fix the problematic delimiters in the original, and reparse. You’ll want to search not only for false positives, i.e. delimiters added where they shouldn’t have been, but also for false negatives, volumes that should have delimiters but were missed. For example, search ^p^# in Word to check for any new paragraphs starting with a number (assuming the inventory starts each paragraph with the volume number).

But wait, there’s more. Once you’ve parsed those, you can even take it a step further. The summaries aren’t usually at the document level, but there is enough detail that it’s worth parsing the descriptions within each volume.  After converting all the parsed txt files to rtf files, move each volume’s document to the appropriate provenance tag/group, and then run another parse on that volume’s record, with the ; as delimiter. In the case above, you might want to also parse by the French open quotation mark, or find-replace the « with ;«. Parsing this volume summary gives you a separate record for each topic within the volume, or at least most of them. With all these new parsed records still selected, convert to rtf and add the provenance info to the Spotlight Comments. Now you’re ready to assign each parsed topic document to whichever topical groups you want.

AG A1 inventaire DTPO

 

It’s not perfect, but it’s pretty darn good considering how little effort it requires; maybe an hour or so gets you 500+ volume inventories in separate records. Now you’ve got all those proper nouns in short little documents, ready to search, (auto-)group and sort.

Worth a(nother) look

Yes, I know I spend way too much time thinking about note-taking. What of it?

While reading some online discussions of software-based textual analysis, I came across a link to this excellent article summarizing the weaknesses of full-text search: Jeffrey Beall, “The Weaknesses of Full-Text Searching,” The Journal of Academic Librarianship 34, no. 5 (September 2008): 438-444. Abstract:

This paper provides a theoretical critique of the deficiencies of full-text searching in academic library databases. Because full-text searching relies on matching words in a search query with words in online resources, it is an inefficient method of finding information in a database. This matching fails to retrieve synonyms, and it also retrieves unwanted homonyms. Numerous other problems also make full-text searching an ineffective information retrieval tool. Academic libraries purchase and subscribe to numerous proprietary databases, many of which rely on full-text searching for access and discovery. An understanding of the weaknesses of full-text searching is needed to evaluate the search and discovery capabilities of academic library databases.

If you ever need to explain to your students why keywords and subject headings and indices (indexes) are useful tools, this article is a good place to start.

Full-text search is certainly better than nothing – particularly if you can use fuzzy searching, wildcards, and proximity – but I sometimes wonder if a keyword-only database (a digital index) would still be more helpful than a full-text database, everything else being equal.

Repeat after me: full-text searching must be combined with meta-data in order to search subsets and sort results.

Keyboard Maestro + Devonthink

A reader request prompts me to post screenshots of some of the Keyboard Maestro macros I’ve developed to speed up data entry. See my previous post and comments for a description of the various macros. Read More…

Quick Tech Tip: The Power of Placeholders and Cobbling Things Together on the Cheap

So you just received a whole box full of newish French books on warfare in the age of Louis XIV, but don’t have time to read through them all now, much less copy them for search purposes? Just scan and OCR the Table of Contents and Index of each book into your digitized note-taking system of choice (mine still being DTPO, thank you very much). These will serve as placeholders (a virtual index, if you will) whose keywords will show up in your search-for-a-string results, leading you to the bookshelf and the relevant pages. For even better utility, add some keywords (or group in a topical group) for metadata-powered filtering and sorting.

If you have a little more time, use a batch find/replace (e.g. in MS Word, which you can open DT documents in) and automatically split each Index entry into its own record. For example, if, after you’ve OCRed a book’s Index into a text document, each entry ends with a page number, a period, then a paragraph mark and a new line, just search (in Word) for “.^p” and replace that text string with a unique marker e.g. Replace “.^p” with: “.#####^p”. Save and close the Word document, go back to DT, then run your Explode by Delimiter Applescript (from Devonthink script forum – you can also find similar code in WordVBA online). Enter the delimiter ##### and you’ll get hundreds of records on individual topics that can then be sent (automatically via AutoClassify) to an appropriate topic group. With such automation it’s always a good idea to first skim through the results to make sure there weren’t any errors or snafus. If there are many, delete the resulting files, fix those delimiters in the original doc in Word, and repeat the exploding. Note: with DT you may need to convert the resulting documents from .txt to .rft – you can do it in a single batch though, right after the parsing process when they’re all still selected. And if you keep your provenance data in the Spotlight Comments, open the Info window and type the source info in while you still have all those documents selected.

LD Word replace

This takes several steps and is not particularly elegant, but if you’re dealing with hundreds or thousands of records and don’t want to take the time to learn Regex or Applescript or Python or the next programming-language-of-the-month, it will be well worth your while. I just used this process to import and parse 25,000 records from my old Access database into DTPO, as well as parsing 1600 letters from Marlborough’s Letters & Dispatches that I hadn’t yet entered individually into Access. As “one thought-one note” cultists already know, small chunks make searching and processing much much easier. And it makes a huge difference when using DTPO’s proximity search. Digitize, man!

The_more_you_know_banner