My post ideas are usually extremely long and involved, which means I have a few dozen drafts that aren’t finished. So I’ll take a different tack for DT and just include a series of short-ish post on how I’m using DT now, showing a variety of usage scenarios with screen shots. 1100 words isn’t particularly short for a blog, but it’s my blog.
Unfortunately nobody that I know of has come up with a typology of the types of notes one might take, beyond the barebones. So I’m calling this one the RTF-notecard-from-specific-page-of-image-PDF technique. Not quite ‘flying crane’, but I lack the Buddhist monks’ combination of wisdom and careful observation of the natural world. This post largely explains the process that replaces what I described in an earlier post, with thanks to korm on the DT support forum for the Applescript which I then tweaked.
Say you’ve got a PDF of a primary source without text (OCR doesn’t work) in DT. It could be a scanned volume of archival documents, could be an old book.
1. I open the PDF in a separate window, move and resize the window to fill more than half the screen, and zoom in to a comfortable reading level.
2. Start reading.
3. When I come across something that is worth taking note of, I take note of. Specifically, I select the page: either Cmd-A, with the focus in the page (not the thumbnails), or just drag across the page. You don’t need to actually select any text per se, which helps because there isn’t any text in an image-only PDF.
4. Then I invoke the New RTF from Selected Text with URL to PDF macro (Ctl-Opt-Cmd-P for me), as discussed in the aforementioned post. This prompts you to title the new document.
I overwrite the default (the name of the original PDF file), and instead use a substantive title, like an executive summary of the point being made, e.g. Tutchin says the French are morons. This popup window is really helpful because it forces you to make a summary. Remember that efficient note taking requires a brief summary, which relieves you from having to reread the same quote (possibly several sentences or even a paragraph) every time you need to figure out what it says. One of the most useful examples is how naming your files by summary makes it much easier to plow through Search results when you’re performing a needle-in-a-haystack search.
In needle-in-a-haystack searches most notes aren’t what you’re looking for – you need a quick way to discard false hits. In many other instances you’re looking for a specific variation on a theme – you need a quick way to distinguish similar items. Thus, a summary title allows you to quickly see that a specific note isn’t on the right topic; it similarly allows you to quickly find a certain variation on the general theme of French stupidity, for example. Having columns to sort the search results by would also facilitate this.
5. After I’ve named the RTF note and hit Enter, I’m prompted to send it to a particular group
For the purposes of speed I usually just default to the Inbox by pressing Return and then use the Auto-Classify to help me process them (in the Inbox) in a single session. But you could, if you want, find the proper group (not tag however), and then that will be the default group from then on. Usually, though, the same PS will be addressing different topics, which would require navigating my 1000s of groups in that tiny little window. So I go for speed at this phase.
Then the code does more magic. It adds a link from the original PDF to the new RTF note (in the URL field, which is the blue link at the top of the RTF). This allows you to jump back to the original whenever you want. The code also copies the title of the PDF file to the Spotlight Comments of the new RTF field (Bonus material: I use the Spotlight Comments as another place to put the provenance info – that way if I ever need to cite a specific file, I can just select the record in DT’s list pane, Tab to the Spotlight Comments field, Copy the already-selected text and then paste it elsewhere). The code also opens up the new RTF in its own window (which you may need to relocate/resize), and pastes the file name into the content of the RTF file. I do that last step because the AI only works on alphanumeric characters within the file, not the file name or other metadata.
6. Now the blinking cursor is in the RTF, with the original image visible, just waiting for your input. You can make further notes and comments, or transcribe however much of the PS you desire.
7. Then you add additional tags or groups in the Tag bar of the RTF (Ctl-Tab from the content pane). You can also run Auto-Classify (the magic hat) if you want to move it to a different group, or have other suggested groups that you then manually enter in. (Remember that Auto-Classify moves the record to a different group, so don’t use it if you’ve gone to the trouble of already selecting a group in step 5).
8. When you’re all done with this single notecard, close it. Now you’re back to the original PDF where you left off. Continue your reading and repeat the process to your heart’s content.
9. If you send all your RTF notes to the Inbox, you’ll need, at some point, to go to the Inbox and assign the notecards RTFs to groups, either with Auto-Classify or by assigning your own tags. If you manually add tags to files in the Inbox, their file names will turn red (indicating there are aliases – aliasi? – in several groups). You’ll then need to get them out of the Inbox (reduce clutter) by dragging them to the Untagged group you’ve already created, then run the Remove Tags from Selection macro on the selected Untagged files.
All this may sound complicated at first, but it becomes second nature once you’ve done it a few times, and once you understand how Devonthink works in general. The busy work of opening and tagging and such only takes a few seconds per note – certainly no slower than writing a physical notecard.
Short post as I have several research projects that need to finish up before school starts in two weeks.
With help from some code on the DT forum (and my programming wife), I finally managed to come up with a smooth workflow for taking notes. I have literally 1000s of PDFs that I need to take notes on – a quote here, a paragraph there, my disapproval noted elsewhere. DT comes with an Annotation script that will create a new document (linked back to the original) that you can then take notes in. I don’t use it because (as far as I can tell) you can only have one Annotation document for each PDF. Since I am a member of the Cult of The One (Thought, One Note), that won’t work for me.
So as I would come across a salient point in a PDF, I’d do the following:
- Copy Page Link for the page of interest
- Create a new RTF
- Name the file with a summary of the point being made
- Tab to the Spotlight Comment and type/paste the citation info (even though I still use tabs for provenance info, I always include the cite info in the comments)
- Jump to the body of the RTF to type ‘###’
- Select this ### string
- Add a Link from that ### back to the original PDF page. It’s always good to have original (co)ntext at hand.
- Then start typing my notes.
Needless to say, this takes many steps – I made it a bit shorter with macros, but not short enough. Read More…
In several previous posts I’ve expressed concern about the pressures encouraging historians to soft-pedal or even minimize the evidence supporting their claims in their published works. In our current age of decreasing attention spans and declining academic books sales, there is growing pressure to eliminate the footnotes (or at least relegate them to annoying endnotes) and shorten the lengths of scholarly tomes by eliminating discussion of historiography and cited evidence. Some historians even counsel us to cull the evidence for our claims to a single example. What’s a historian to do?
Do we really need more than one example? Read More…
After a recent post, I received an email from a blog reader pointing me to an online project focused on presenting information on WW2 in a digital environment: Envisioning History. If you’re interested in the potential for digital history, you should check out some of its YouTube videos.
Watching a few of the videos made me appreciate once again how different note-taking needs can be from one academic inquiry to another. For historians they appear to break down into two categories:
- Storing the unstructured data of the original sources themselves, whether they be scans of archival documents or photographs of battle paintings or fortifications.
- Keeping tabs on the structured data extracted from the unstructured data for a particular analytical purpose, usually a summary of the sources’ content, most often in the form of summaries, keywords or quantities, or maybe a quote or two.
These two types of information are fundamentally different yet related, as many methodological treatises will no doubt explain. They also require different types of note-taking capabilities. Historians are generally generalists and a surprising number still rely on 3″x5″ notecards and simple Word documents, but those on the cutting edge (like me?) also want all the cool toys our colleagues in other fields play with. Back in the day, I referred to the options as a quadrangle or rectangle.
On the one hand, historians want the unstructured, original documents, in all their messy glory. That means, ideally, transcripts of the full document, as well images of the originals. But the hard work for historians is to find structure in this chaos. We need the originals in case we need to verify a quote or ask a new question. But usually we are creating structured data by extracting information from the original sources according to a pre-determined method – ideally the methodological choices are made clear to the reader. Structuring the unstructured requires, at a minimum, keywords and categories, with a link back to the original whenever possible. There’s an immense amount of winnowing in the journey from source to analysis.
Unfortunately, few off-the-shelf software packages handle both unstructured and structured information with equal facility, a limitation I am once again confronting in my transition from MS Access to Devonthink. The CLIO or Kleio software by Manfred Thaller was an early, extremely historio-centric, attempt, but most historians fall back on packages with a broader user base. In part the specialization of software is probably explained by cultural explanations as much as purely technical ones. Over the several decades of the Personal Computing Age, increasing processor power has expanded our toolkit far beyond the simple, business-friendly relational databases and punch-card legacy systems of the early days. But it took time – well into the 1990s social scientists still shoehorned their qualitative data into the quantitative model preferred by most early software, creating quantitative dummy variables (0=No, 1=Yes) and numeric codes (1=French, 2=English, 3=Dutch) that could be handled by the slower processors of the period. Though the 0/1-Yes/No dummy variable retains its elegant simplicity, today’s more powerful relational databases can handle big chunks of rich text, but we also have so many other types of analysis to choose from: quantitative analysis software like SPSS and Minitab, GIS software which allows us to analyze geospatial relationships between objects, and textual analysis software for those seeking word frequencies, KWIC, co-occurrences, topic modeling and the like. Modern programs even allow photographic (and increasingly video) analysis, Picasa’s facial recognition being one of the simplest examples. Today, “big data” is as likely to be composed of text or images as it is of numbers.
This embarrassment of computational and data riches (the data-mining metaphor isn’t accidental) comes at the price of having to deal with often-incompatible software packages, not to mention distinctive methodologies. Unfortunately I don’t have a solution for methodologically eclectic historians, but I figured I could at least give those ignorant in relational databases a better sense of what this type of software does well. So here I post my Powerpoint slides from a presentation I gave 14 years ago, describing the basic design of my note-taking Access database. It preserved both the unstructured original sources (transcriptions only, alas) and built a layer of structured data on top of it. As it turns out, I didn’t use it to its full potential, but perhaps it might be of use to others, if they can suss out what the slides mean without my accompanying explanation (that’ll cost you extra). Maybe I’ll even return to relational databasing in my next project. Read More…
In any note-taking system, having lots of categories (or “fields” in database parlance) in which to record information about sources (aka meta-data) is important. Really important. For historians, simple tags aren’t enough, nor is relying on the full-text alone, nor is a single hierarchical organizational scheme. What on earth do I mean? Keep reading.
Let’s say I’m looking at the question of how early moderns perceived military deception, trickery, subterfuge, lying, and the like. You find a bunch of germane sources and take notes on their discussions of the subject. To simplify matters, we’ll stay theoretical, focusing on how contemporaries expressed their preferences for either deception or straightforwardness in war, rather than using other measures (the frequency of their actual reliance on such sly stratagems, etc.). Perhaps you try to get a good balance of sources, including theoretical discussion of the use of stratagems as well as reactions to specific examples of deception on campaign.
So you’ve found dozens or hundreds of quotes from contemporaries – verily, the digital age has given early modern historians (military ones at least) a bounty of evidence to sift through! You have plenty of notes on other topics as well, so you need to identify those specific to deception for future reference. You could take the shortest route and simply tag them all with a “deception” tag. Assuming you want to examine whether contemporaries considered deception laudable or execrable, you could make your life easier and further sort your quotes into two bins based on the stance presented in each source: those quotes that support the use of deception (pro-deception) and those that are against it (anti-deception). Or maybe there are three possible values for this variable: pro, con, and ambivalent/neutral. Heck, maybe add in a separate value for those sources that you checked which you’d think should discuss the utility of deception but actually don’t – perhaps that’s noteworthy. So we have one Stance category with four possible values: pro, con, neutral/ambivalent, or not discussed. Call the Stance variable a group, a tag, a field, a category, or what have you. Note-takers don’t appear to have a standardized vocabulary for such things.
Thus tagged, you can now easily find those hundreds of examples in just about any note-taking system worth its salt. But historians, at least systematic ones, want to look for patterns within that data. A whole range of questions start to bubble up. Maybe you want to see if particular types of people shared the same view on the subject? Or whether geography or chronology (or both) played a role in shaping opinions on the morality of deception? Maybe you wonder whether these contemporary prescriptions and proscriptions were universal, or did contemporary judgments depend on the situation? Maybe you speculate that there are finer gradations within the pro-deception camp? Figuring out all of these questions require slicing the data in a number of different ways. To do this, you need additional categories to differentiate your notes on deception.
- Let’s say you’re looking at the question primarily from the perspective of two groups, say, the English and the French (see how we’re staying totally hypothetical here?). So maybe you want to see if the English were more likely than the French to express their opposition to the use of trickery. (If you wanted to count up the quotes, you could even get all statistical, in a cross tab-y sort of way: nominal variable x nominal variable.) So we should add another variable, call it Side, to the Stance measure. That’s two variables to keep track of for each quote.
- But “Side” isn’t really specific enough. What we probably mean by “Side” is actually the side of the author of the quote. This assumes, of course, that the author’s point of view is represented by the quote, i.e. the quote isn’t part of a dialogue where one or both characters might not even represent the author’s position, or the author isn’t playing devil’s advocate in the text, or presenting the opposition’s case before rebutting it… Add a SideOfAuthor category.
- And if we have an SideOfAuthor field, we probably also need to note when a French author is taking about the French, versus when he might be talking about the English (maybe he’s even talking about a specific Englishman). Perhaps authors saw a difference between the two worth noting? Make it three variables (Stance, SideOfAuthor, SideOfSubject) for our note-taking system, or four for greater precision (Stance, SideOfAuthor, SideOfSubject, SubjectPerson).
- But maybe you perceive that the English seem more likely to talk about deception when the French are doing the deceiving – now we’re combining the previous categories. Cynical by nature, you wonder if the French do the same – seeing the deceptive dustmote in the English eye while ignoring the lying log in their own. Is it generally true that each side tends to downplay its own deceptive qualities and highlight those of its enemies, or is there a shared preference for (or against) deception in war? You could figure it out by looking for all the cases where a French author discusses a French subject, and when an English author discusses with an English subject. To be certain, you’d need to examine all four cells of what I like to call “the box.”
To quickly sort all those hundreds of quotes into the four cells, you need those categories. In case you want to look at more than two sides – throw in the Dutch for good measure – you’ll convert your SideOfAuthor variable from a binary one (possibilities being French or English) to one allowing three or more options. If you want to make it a bit easier, you could add another binary variable, “Self”, which includes possible values of “self”, i.e. the author’s own side, and “other,” i.e. not the author’s side. That way, if you want to query whether authors were self-serving or not, you can simply use the Self variable, rather than test all the pairs of SideOfSubject and SideOfAuthor (SoS=French and SoA=French, SoS=English and SoA=English, SoS=Dutch and SoA=Dutch…).
But just because you’ve aggregated upward, don’t throw out the more specific SideOfSubject variable. Keep it so you can, for example, group together all the non-English authors’ views of the English if you were looking at more than two countries – to find out if everybody else is ganging up on the poor ol’ English (SoS=English and SoA=not English). Or maybe you’re interested to see if one nationality is more likely to talk about themselves (or the Other) than the Other.
- We might want to add yet another variable, what I vestigially refer to as EventID. This could be composed of another generic-specific variable pair: a specific combat field (the specific variable) and a related field for combat type (a generic variable based off of the specific combat). There should be a field indicating the type of combat (deception in a surprise attempt, deception in a battle, deception in a siege…). Is deception more acceptable if it leads to a battle than if it leads to a siege? The specific variable in the pair would be a specific combat (the battle of Ramillies, the siege of Lille, the surprisal of Ghent…). Keeping paired general-specific variables gives you flexibility and scalability.
Other plausible variables could be derived from this Event category. Perhaps contemporaries were pure functionalists: they praised (actual) cases of deception – regardless of who deceived whom – when they succeeded, and excoriated liars only when they were caught out? Or maybe there was a gradual shift in how a specific example of deception was viewed – maybe the French surprisal of Ghent in 1708 was declared base treachery when the English first learned of it, but within a few years the English had gained perspective and come to appreciate the cunning French trick? Perhaps you could examine those curious cases where you know deception was used at a specific event yet its use was not mentioned in a particular source (Stance=not discussed)…
So now we have a good setup, needing at least five different variables to answer our constellation of questions surrounding early modern views on military deception: Stance, SideOfAuthor, SideOfSubject, Self, and Event. Each quote needs the answers to these five questions. And that doesn’t include all the other variables that are associated with these five categories. To give just one example, each author has various attributes that might be germane in addition to their Side: maybe opinions on the utility of deception vary by one’s generational status, by an author’s military experience, by the theaters he fought in or the other authors that author read… Lots of metadata to keep track of.
Is this note-taking scenario over the top? Not at all. Not if historians want to base their arguments on more than a random or superficial glance at the sources. Not if historians want to avoid magically finding only evidence that confirms their preconceptions. And not if historians want to make use of the range of sources now easily accessible. In fact, the above example is even simpler than the reality. We could start by simply adding further possible values to existing variables. For the Self category, it would probably be important to appreciate that an English author could be talking about an Englishman (maybe need to break that up further into Tory vs. Whig, or Court-Country, as well as which the author is), or this English author could be referring to an enemy (who could be the hated French, or maybe the despised-yet-want-to-be-trading-partners-with-them-all-the-same Spanish, or the problematic Protestant Hungarians rebelling against England’s Catholic Austrian ally), or perhaps the English author actually was referring to an ally (Prince Eugene), or perhaps even discussing a neutral (Charles XII of Sweden in his concurrent Great Northern War, or maybe the less-sympathetic Terrible Turks), or maybe it’s a historical reference to Caesar’s use of deception. So at the least, possible Self values could include: own side, enemy, ally, neutral, historical. I can certainly imagine a scenario where an author would opportunistically treat his friends and allies more gently than his enemies, but his friends more kindly even than his allies (or certain kinds of allies, maybe the Protestant ones…). Other structural additions might be needed: some of the variables might have records that require multiple values: maybe it’s noteworthy to see if, whenever English authors discuss their own deceptive practices, they always introduce French subterfuge to muddy the issue? In this case SideOfSubject might need both French and English values for the same quote, or at least a “more than one” value. Does your note-taking system allow this? I hope so. You could plausibly keep throwing in additional variables – the list goes on and on.
Too detailed? Not really. Too specialized to be useful for other topics? Hardly. These categories are not unique to the question of military deception. These categories are, in fact, inherent in just about every possible topic somebody might be discussing: Author A, member of group B (SideOfAuthor), said C (Stance) about person/group D’s behavior (SideOfSubject) regarding topic E (in this case, use of military deception). Whenever more than one type of people (or more than one person) says something about somebody else (or themselves), either in general or relating to a specific instance, and has an opinion on it, you should be tracking these details. You’ll never know if your sources are being self-serving or not without this information. You’ll never know how widespread a particular opinion was without this information. You’ll never know which possible patterns help explain the phenomenon under study without this information. In short, you’ll never really know.
How does this intersect with note-taking? You can mentally assign values to each of these variables every time you read the quote, but note-taking is about summarizing the quote in numerous ways so you don’t need to reprocess it every time. You need categories, ways to organize any single quote, or any list of quotes, by any (or all) of these variables. So you might as well spend a few minutes at the start (after you’re read some sources) figuring out which variables are worth tracking and what their possible values might be (keep room for future changes), and then track them for each record. That’s the note-taker’s way.
Next post: how this all relates to Devonthink. I think.
Historians owe a debt of gratitude to those turn-of-the-century archivists, whose nationalistic yearnings led to the creation of dozens of volumes of archive inventories and catalogs. If you do much work on French military history, you likely know the Inventaire sommaire des archives historiques: archives de guerre. This multi-volume inventory provides short summaries of each volume in the A1 correspondence series, more than 3500 volumes up to the year 1722. That’s a lot to keep track of, as I estimated awhile back. So much in fact, that you’ll likely be going back to that particular well again and again. If so, it might be worth your while to include those details in your note-taking system. Here’s how I did it in DTPO.
First step for any digitization process is to scan in the printed page and convert the page images into text. In ye olden days you had to do it yourself, and then run it through OCR software. Nowadays it’s more likely that you can download an .epub version of it from Google Books, and then convert it to .txt with the free Calibre software. Worst case, download the PDF version and OCR it yourself.
Now you find yourself with a text document full of historical minutiae. Import the text (and the PDF, just to be safe) into DTPO. Next, add some delimiters which will indicate where to separate the one big file into many little files. But do it smart, with automation. Open the text document in Word (right-click in DTPO or find it in the Finder), and then start your mass find-replace iterations to add delimiters, assuming there’s a pattern that you can use to add delimiters between each volume. Maybe each volume is separated by two paragraph marks in a row, in which case you would add a delimiter like ##### at the end of ^p^p. You’ll end up with something like this:
As you can see, the results are a bit on the dirty side – I’ll see if I can get a student worker to clean up the volume numbers since they’re kinda important, but the main text is good enough to yield search results.
Once you’ve saved the doc in Word and returned to DTPO, you can use the DT forum’s Explode with Delimiter script. Check the resulting hundreds of records – if there’s more than a few errors, erase the newly-created documents, fix the problematic delimiters in the original, and reparse. You’ll want to search not only for false positives, i.e. delimiters added where they shouldn’t have been, but also for false negatives, volumes that should have delimiters but were missed. For example, search ^p^# in Word to check for any new paragraphs starting with a number (assuming the inventory starts each paragraph with the volume number).
But wait, there’s more. Once you’ve parsed those, you can even take it a step further. The summaries aren’t usually at the document level, but there is enough detail that it’s worth parsing the descriptions within each volume. After converting all the parsed txt files to rtf files, move each volume’s document to the appropriate provenance tag/group, and then run another parse on that volume’s record, with the ; as delimiter. In the case above, you might want to also parse by the French open quotation mark, or find-replace the « with ;«. Parsing this volume summary gives you a separate record for each topic within the volume, or at least most of them. With all these new parsed records still selected, convert to rtf and add the provenance info to the Spotlight Comments. Now you’re ready to assign each parsed topic document to whichever topical groups you want.
It’s not perfect, but it’s pretty darn good considering how little effort it requires; maybe an hour or so gets you 500+ volume inventories in separate records. Now you’ve got all those proper nouns in short little documents, ready to search, (auto-)group and sort.
Yes, I know I spend way too much time thinking about note-taking. What of it?
While reading some online discussions of software-based textual analysis, I came across a link to this excellent article summarizing the weaknesses of full-text search: Jeffrey Beall, “The Weaknesses of Full-Text Searching,” The Journal of Academic Librarianship 34, no. 5 (September 2008): 438-444. Abstract:
This paper provides a theoretical critique of the deficiencies of full-text searching in academic library databases. Because full-text searching relies on matching words in a search query with words in online resources, it is an inefficient method of finding information in a database. This matching fails to retrieve synonyms, and it also retrieves unwanted homonyms. Numerous other problems also make full-text searching an ineffective information retrieval tool. Academic libraries purchase and subscribe to numerous proprietary databases, many of which rely on full-text searching for access and discovery. An understanding of the weaknesses of full-text searching is needed to evaluate the search and discovery capabilities of academic library databases.
If you ever need to explain to your students why keywords and subject headings and indices (indexes) are useful tools, this article is a good place to start.
Full-text search is certainly better than nothing – particularly if you can use fuzzy searching, wildcards, and proximity – but I sometimes wonder if a keyword-only database (a digital index) would still be more helpful than a full-text database, everything else being equal.
Repeat after me: full-text searching must be combined with meta-data in order to search subsets and sort results.