Archive | Methodology RSS for this section

Voyant-to-web also a success

In case you need proof, here’s a link (collocate) graph from Voyant tools, based off the text from the second volume of the English translation of the “French” Duke of Berwick’s memoirs published in 1779: http://jostwald.com/Voyant/VoyantLinks-Berwick1.html. Curious which words Berwick used most frequently, and which other words they tended to be used with/near? (Or his translator, in any case.) Click the link above and hopefully you’ll see something like this, but interactive:

Screenshot 2017-06-25 14.49.23.png

After you upload your text corpus in the web version of Voyant, you can then export any of the tools and embed it in your own website using an iframe (inline frame). Note that you can also click on any of the terms in the embedded web version and it will open up the full web version of Voyant, with the corpus pre-loaded. Something like this, but oh-so-much-more-interactive:

Screenshot 2017-06-25 14.47.08.png

Apparently the Voyant server keeps a copy of the text you upload – no idea how long the Voyant servers keep the text, but I guess we’ll find out. There’s also a VoyantServer option, which you install on your own computer, for faster processing and greater privacy.

Never heard of Voyant? Then you’d best get yourself some early modern sources in full text format and head on over to http://voyant-tools.org.

Advertisements

Automating Newspaper Dates, Old Style (to New Style)

If you’ve been skulking over the years, you know I have a sweet spot for Devonthink, a receptacle into which I throw all my files (text, image, PDF…) related to research and teaching. I’ve been modifying my DTPO workflow a bit over the past week, which I’ll discuss in the future.

But right now, I’ll provide a little glimpse into my workflow for processing the metadata of the 20,000 newspaper issues (yes, literally 20,000 files) that I’ve downloaded from various online collections over the years: Google Books, but especially Gale’s 17C-18C Burney and Nicholls newspaper collections. I downloaded all those files the old-fashioned way (rather than scraping them), but just because you have all those PDFs in your DTPO database, that still doesn’t mean that they’re necessarily in the easiest format to use. And maybe you made a minor error, but one that is multiplied by the 20,000 times you made that one little error. So buckle up as I describe the process of converting text strings into dates and then back, with AppleScript. Consider it a case study of problem-solving through algorithms.

The Problem(s)

I have several problems I need to fix at this point, generally falling under the category of “cleaning” (as they say in the biz) the date metadata. Going forward, most of the following modifications won’t be necessary.

First, going back several years I stupidly saved each newspaper issue by recording the first date for each issue. No idea why I didn’t realize that the paper came out on the last of those dates, but it is what it is.

Screen Shot 2014-03-09 at 7.53.14 PM

London Gazette: published on Dec. 13 or Dec. 17?

Secondly, those English newspapers are in the Old Style calendar, which the English stubbornly clung to till mid-century. But since most of those newspapers were reporting on events that occurred on the Continent, where they used New Style dates, some dates need manipulating.

Automation to the Rescue!

To automate this process (because I’m not going to re-date 20,000 newspaper issues manually), I’ve enlisted my programmer-wife (TM) to help me automate the process. She doesn’t know the syntax of AppleScript very well, but since she programs in several other languages, and because most programming languages use the same basic principles, and because there’s this Internet thing, she was able to make some scripts that automate most of what I need. So what do I need?

First, for most of the newspapers I need to add several days to the listed date, to reflect the actual date of publication – in other words, to convert the first date listed in the London Gazette example above (Dec. 13) into the second date (Dec. 17). So I need to take the existing date, listed as text in the format 1702.01.02, convert it from a text string into an actual date, and then add several days to it, in order to convert it to the actual date of publication. How many days exactly?

Well, that’s the thing about History – it’s messy. Most of these newspapers tended to be published on a regular schedule, but not too regular. So you often had triweekly publications (published three times per week), that might be published in Tuesday-Thursday, Thursday-Saturday, and Saturday-Tuesday editions. But if you do the math, that means the Saturday-Tuesday issue covers a four-day range, whereas the other two issues per week only cover a three-day range. Since this is all about approximation and first-pass cleaning, I’ll just assume all the issues are three-day ranges, since those should be two-thirds of the total number of issues. For the rest, I have derivative code that will tweak those dates as needed, e.g. add one more to the resulting date if it’s a Saturday-Tuesday issue, instead of a T-R or R-S issue. If I was really fancy, I’d try to figure out how to convert it to weekday and tell the code to treat any Tuesday publication date as a four-day range (assuming it knows dates before 1900, which has been an issue with computers in the past – Y2k anyone?).

So the basic task is to take a filename of ‘1702.01.02 Flying Post.pdf’, convert the first part of the string as text (the ‘1702.01.02’) into a date by defining the first four characters as a year, the 6th & 7th characters as a month…, then add 2 days to the resulting date, and then rename the file with this new date, converted back from date into a string with the format YYYY.MM.DD. Because I was consistent in that part of my naming convention, the first ten characters will always be the date, and the periods can be used as delimiters if needed. Easy-peasey!

But that’s not all. I also need to then convert that date of publication to New Style by adding 11 days to it (assuming the dates are 1700 or later – before 1700 the OS calendar was 10 days behind the NS calendar). But I want to keep the original OS publication date as well, for citation purposes. So I replace the old OS date on the front of the filename with the new NS date, and append the original date to the end of the filename with an ‘OS’ after it for good measure (and delete the .pdf), and Bob’s your uncle. In testing, it works when you shift from one month to another (e.g. January 27 converts to February 7), and even from year to year. I won’t worry about the occasional leap year (1704, 1708, 1712). Nor will I worry about how some newspapers used Lady Day (March 25) as their year-end, meaning that they went from December 30, 1708 to January 2, 1708, and only caught up to 1709 in late March. Nor does it help that their issue numbers are often wrong.

I’m too lazy to figure out how to make the following AppleScript code format like code in WordPress, but the basics look like this:
–Convert English newspaper Title from OSStartDate to NSEndDate & StartDate OS, +2 for weekday
— Based very loosely off Add Prefix To Names, created by Christian Grunenberg Sat May 15 2004.
— Modified by Liz and Jamel Ostwald May 26 2017.
— Copyright (c) 2004-2014. All rights reserved.
— Based on (c) 2001 Apple, Inc.

tell application id “DNtp”
try
set this_selection to the selection
if this_selection is {} then error “Please select some contents.”

repeat with this_item in this_selection

set current_name to the name of this_item
set mydate to texts 1 thru ((offset of ” ” in current_name) – 1) of current_name
set myname to texts 11 thru -5 of current_name

set newdate to the current date
set the year of newdate to (texts 1 thru 4 of mydate)
set the month of newdate to (texts 6 thru 7 of mydate)
set the day of newdate to (texts 9 thru 10 of mydate)

set enddate to newdate + (2 * days)
set newdate to newdate + (13 * days)
tell (newdate)
set daystamp to day
set monthstamp to (its month as integer)
set yearstamp to year
end tell

set daystamp to (texts -2 thru -1 of (“0” & daystamp as text))
set monthstamp to (texts -2 thru -1 of (“0” & monthstamp as text))

set formatdate to yearstamp & “.” & monthstamp & “.” & daystamp as text

tell (enddate)
set daystamp2 to day
set monthstamp2 to (its month as integer)
set yearstamp2 to year
end tell

set daystamp2 to (texts -2 thru -1 of (“0” & daystamp2 as text))
set monthstamp2 to (texts -2 thru -1 of (“0” & monthstamp2 as text))

set formatenddate to yearstamp2 & “.” & monthstamp2 & “.” & daystamp2 as text

set new_item_name to formatdate & myname & ” ” & formatenddate & ” OS”
set the name of this_item to new_item_name

end repeat
on error error_message number error_number
if the error_number is not -128 then display alert “DEVONthink Pro” message error_message as warning
end try
end tell

So once I do all those things, I can use a smart group and sort the Spotlight Comment column chronologically to get an accurate sense of the chronological order in which publications discussed events.

This screenshot shows the difference – some of the English newspapers haven’t been converted yet (I’m doing it paper by paper because the papers were often published on different schedules), but here you can see how OS and NS dates were mixed in willy-nilly, say comparing the fixed Flying Boy and Evening Post with the yet-to-be-fixed London Gazette and Daily Courant issues.

DTPO Newspapers redated.png

Of course the reality has to be even more complicated (Because It’s History!), since an English newspaper published on January 1, 1702 OS will publish items from continental newspapers, dating those articles in NS – e.g., a 1702.01.01 OS English newspaper will have an article dated 1702.01.05 NS from a Dutch paper. So when I take notes on a newspaper issue, I’ll have to change the leading NS date of the new note to the date on the article byline, so it will sort chronologically where it belongs. But still.

There’s gotta be a better way

In preparation for a new introductory digital history course that I’ll be teaching in the fall, I’ve been trying to think about how to share my decades of accumulated computer wisdom with my students (says the wise sage, stroking his long white beard). Since my personal experience with computers goes back to the 80s – actually, the late 70s with Oregon Trail on dial-up in the school library – I’m more of a Web 1.0 guy. Other than blogs, I pretty much ignore social media like Facebook and Twitter (not to mention Snapchat, Instagram, Pinterest…), and try to do most of my computer work on a screen larger than 4″. So I guess that makes me a kind of cyber-troglodyte in 2017. But I think that does allow me a much broader perspective of what computers can and can’t do. One thing I have learned to appreciate, for example, is how many incremental workflow improvements are readily available, shortcuts that don’t require writing Python from the terminal line.

As a result, I’ll probably start the course with an overview of the variety of ways computers can help us complete our tasks more quickly and easily, which requires understanding the variety of ways in which we can achieve these efficiencies. After a few minutes of thought (and approval from my “full-stack” computer-programming wife), I came up with this spectrum that suggests the ways in which we can make computers do more of our work for us. Toil, silicon slave, toil!

Computer automation spectrum.png

Automation Spectrum: It’s Only a Model

Undoubtedly others have already expressed this basic idea, but most of the digital humanities/digital history I’ve seen online is much more focused on the extreme right of this spectrum (e.g. the quite useful but slightly intimidating Programming Historian) – this makes sense if you’re trying to distantly read big data across thousands of documents. But I’m not interested in the debate whether ‘real’ digital humanists need to program or not, and in any case I’m focused on undergraduate History majors that often have limited computer skills (mobile apps are just too easy). Therefore I’m happy if I can remind students that there are a large variety of powerful automation features available to people with just a little bit of computer smarts and an Internet connection, things that don’t require learning to speak Javascript or Python fluently. Call it kaizen if you want. The middle of the automation spectrum, in other words.

So I’ll want my students, for example, to think about low-hanging fruit (efficiency fruit?) that they can spend five minutes googling and save themselves hours of mindless labor. As an example, I’m embarrassed to admit that it was only when sketching this spectrum that I realized that I should try to automate one of the most annoying features of my current note-taking system, the need to clean up hundreds of PDFs downloaded from various databases: Google Books, Gale’s newspaper and book databases, etc. If you spend any time downloading early modern primary sources (or scan secondary sources), you know that the standard file format continues to be Adobe Acrobat PDFs. And if you’ve seen the quality of early modern OCR’d text, you know why having the original page images is a good idea.

But you may want, for example, to delete pages from PDFs that include various copyright text – that text will confuse DTPO’s AI and your searches. I’m sure there are more sophisticated ways of doing that, but the spectrum above should prompt you to wonder whether Adobe Acrobat has some kind of script or macro feature that might speed up deleting such pages from 1,000s (literally) of PDF documents that you’ve downloaded over the years. And, lo and behold, Adobe Acrobat does indeed have an automation feature that allows you to carry out the same PDF manipulation again and again. Once you realize “there’s gotta be a better way!”, you only need to figure out what that feature is called in the application in question. For Adobe Acrobat it used to be called batch processing, but in Adobe Acrobat Pro DC such mass manipulations now fall under the Actions moniker. So google ‘Adobe Acrobat Actions’ and you’ll quickly find websites that allow you to download various actions people have created. Which allows you to quickly learn how the feature works, and to modify existing actions. For example, I made this Acrobat Action to add “ps” (primary source) to the Keywords metadata field of every PDF file in the designated folder:

Screenshot 2017-05-10 18.52.17.png

I already copied and tweaked macros and Applescripts that will add Keywords to rich text files in my Devonthink database, but this Adobe solution is ideal after I’ve downloaded hundreds of PDFs from, say, a newspaper database.

Similarly, this next action will delete the last page of every PDF in the designated folder. (I just hardcoded to delete page 4, because I know newspaper X always has 4 pages – I can sort by file size to locate any outliers – and the last page is always the copyright page with the nasty text I want to delete. I can, for example, change the exact page number for each newspaper series, though there’s probably a way to make this a variable that the user can specify with each use):

Screenshot 2017-05-10 18.52.43.png

Computers usually have multiple ways to do any specific task. For us non-programmers, the internet is full of communities of nerds who explain how to automate all sorts of software tasks – forums (fora?) are truly a god-send. But it first requires us to expect more from our computers and our software. For any given software, RTFM (as they say), and then check out the software’s website forum – you’ll be amazed at the stuff you find. Hopefully all that time you save from automation won’t be spent obsessively reading the forum!

No wonder we historians are bad at math – they keep changing the answers

Apropos an old thread on naming wars based off their duration (and how complicated that really is), this story appeared recently on my History News Network feed. It’s neither early modern nor European, but it’s been a busy six months.

file-jan-13-9-59-00-pm

Story from the New York Times: https://www.nytimes.com/2017/01/11/world/asia/china-japan-textbooks-war.html?_r=0

Professor: “How long was the Eight-Year War of Resistance against Japanese Aggression”?
Student: “Eight years.”
Professor: “Wrong. Fourteen.”

My main thought: while it’s nice that there’s an official name for wars, just imagine the need to change all those references and Library of Congress subject headings. Ugh.

 

I sure do love Lincoln and Washington

Because they give us U.S. faculty on a MWF teaching schedule a full week off in the Spring, and that’s before Spring Break. Which, combined with the two consecutive snow days last Friday and this past Monday, mean I’ve had the time to finish up my siege capitulation chapter (okay, 99% done) that I’ve been working on forever. Literally. I wrote a graduate seminar paper on the subject circa 1994.

Why has it taken so long to finish this chapter with a target length of only 12,000 words? Let me count the ways, leaving aside non-project issues: Read More…

GTD in Pocket Informant Usage Scenarios

So what’s the point, you might ask, of fiddling around with all these arcane beasts like Context and Action? And why should we care about the Parent Task-Child Task relationship? Read on for how my setup in PI is used.

Read More…

Checklists for the Academic Grind

With all the preparatory discussion of Getting Things Done out of the way, here are a variety of work-related projects and tasks that I use in Pocket Informant, most of which can be reused regularly. These are largely glorified checklists, which, recent research has shown, are extremely useful. Even experts (like experienced doctors and pilots) benefit from them when time is short and focus is too easily distracted. Remember that the point of these GTD-themed task checklists is to do all of the thinking about the process once, when you make the list (though you can obviously revise it). So GTD checklists should be slightly different from “trigger” lists (lists of things that trigger you to remember other things), in that the GTD tasks should be explicit physical actions. This way you don’t have to reconstruct the process every time, usually right before the thing is due. You don’t have to, for example, spend the mental energy (as small as it might be – it adds up) to remember what exactly you’re supposed to do with the ‘receipts’ entry on that list. These lists are made even more useful with additional metadata à la GTD, and the ability to integrate them into your calendar. Read More…