Archive | May 2017

Automating Newspaper Dates, Old Style (to New Style)

If you’ve been skulking over the years, you know I have a sweet spot for Devonthink, a receptacle into which I throw all my files (text, image, PDF…) related to research and teaching. I’ve been modifying my DTPO workflow a bit over the past week, which I’ll discuss in the future.

But right now, I’ll provide a little glimpse into my workflow for processing the metadata of the 20,000 newspaper issues (yes, literally 20,000 files) that I’ve downloaded from various online collections over the years: Google Books, but especially Gale’s 17C-18C Burney and Nicholls newspaper collections. I downloaded all those files the old-fashioned way (rather than scraping them), but just because you have all those PDFs in your DTPO database, that still doesn’t mean that they’re necessarily in the easiest format to use. And maybe you made a minor error, but one that is multiplied by the 20,000 times you made that one little error. So buckle up as I describe the process of converting text strings into dates and then back, with AppleScript. Consider it a case study of problem-solving through algorithms.

The Problem(s)

I have several problems I need to fix at this point, generally falling under the category of “cleaning” (as they say in the biz) the date metadata. Going forward, most of the following modifications won’t be necessary.

First, going back several years I stupidly saved each newspaper issue by recording the first date for each issue. No idea why I didn’t realize that the paper came out on the last of those dates, but it is what it is.

Screen Shot 2014-03-09 at 7.53.14 PM

London Gazette: published on Dec. 13 or Dec. 17?

Secondly, those English newspapers are in the Old Style calendar, which the English stubbornly clung to till mid-century. But since most of those newspapers were reporting on events that occurred on the Continent, where they used New Style dates, some dates need manipulating.

Automation to the Rescue!

To automate this process (because I’m not going to re-date 20,000 newspaper issues manually), I’ve enlisted my programmer-wife (TM) to help me automate the process. She doesn’t know the syntax of AppleScript very well, but since she programs in several other languages, and because most programming languages use the same basic principles, and because there’s this Internet thing, she was able to make some scripts that automate most of what I need. So what do I need?

First, for most of the newspapers I need to add several days to the listed date, to reflect the actual date of publication – in other words, to convert the first date listed in the London Gazette example above (Dec. 13) into the second date (Dec. 17). So I need to take the existing date, listed as text in the format 1702.01.02, convert it from a text string into an actual date, and then add several days to it, in order to convert it to the actual date of publication. How many days exactly?

Well, that’s the thing about History – it’s messy. Most of these newspapers tended to be published on a regular schedule, but not too regular. So you often had triweekly publications (published three times per week), that might be published in Tuesday-Thursday, Thursday-Saturday, and Saturday-Tuesday editions. But if you do the math, that means the Saturday-Tuesday issue covers a four-day range, whereas the other two issues per week only cover a three-day range. Since this is all about approximation and first-pass cleaning, I’ll just assume all the issues are three-day ranges, since those should be two-thirds of the total number of issues. For the rest, I have derivative code that will tweak those dates as needed, e.g. add one more to the resulting date if it’s a Saturday-Tuesday issue, instead of a T-R or R-S issue. If I was really fancy, I’d try to figure out how to convert it to weekday and tell the code to treat any Tuesday publication date as a four-day range (assuming it knows dates before 1900, which has been an issue with computers in the past – Y2k anyone?).

So the basic task is to take a filename of ‘1702.01.02 Flying Post.pdf’, convert the first part of the string as text (the ‘1702.01.02’) into a date by defining the first four characters as a year, the 6th & 7th characters as a month…, then add 2 days to the resulting date, and then rename the file with this new date, converted back from date into a string with the format YYYY.MM.DD. Because I was consistent in that part of my naming convention, the first ten characters will always be the date, and the periods can be used as delimiters if needed. Easy-peasey!

But that’s not all. I also need to then convert that date of publication to New Style by adding 11 days to it (assuming the dates are 1700 or later – before 1700 the OS calendar was 10 days behind the NS calendar). But I want to keep the original OS publication date as well, for citation purposes. So I replace the old OS date on the front of the filename with the new NS date, and append the original date to the end of the filename with an ‘OS’ after it for good measure (and delete the .pdf), and Bob’s your uncle. In testing, it works when you shift from one month to another (e.g. January 27 converts to February 7), and even from year to year. I won’t worry about the occasional leap year (1704, 1708, 1712). Nor will I worry about how some newspapers used Lady Day (March 25) as their year-end, meaning that they went from December 30, 1708 to January 2, 1708, and only caught up to 1709 in late March. Nor does it help that their issue numbers are often wrong.

I’m too lazy to figure out how to make the following AppleScript code format like code in WordPress, but the basics look like this:
–Convert English newspaper Title from OSStartDate to NSEndDate & StartDate OS, +2 for weekday
— Based very loosely off Add Prefix To Names, created by Christian Grunenberg Sat May 15 2004.
— Modified by Liz and Jamel Ostwald May 26 2017.
— Copyright (c) 2004-2014. All rights reserved.
— Based on (c) 2001 Apple, Inc.

tell application id “DNtp”
set this_selection to the selection
if this_selection is {} then error “Please select some contents.”

repeat with this_item in this_selection

set current_name to the name of this_item
set mydate to texts 1 thru ((offset of ” ” in current_name) – 1) of current_name
set myname to texts 11 thru -5 of current_name

set newdate to the current date
set the year of newdate to (texts 1 thru 4 of mydate)
set the month of newdate to (texts 6 thru 7 of mydate)
set the day of newdate to (texts 9 thru 10 of mydate)

set enddate to newdate + (2 * days)
set newdate to newdate + (13 * days)
tell (newdate)
set daystamp to day
set monthstamp to (its month as integer)
set yearstamp to year
end tell

set daystamp to (texts -2 thru -1 of (“0” & daystamp as text))
set monthstamp to (texts -2 thru -1 of (“0” & monthstamp as text))

set formatdate to yearstamp & “.” & monthstamp & “.” & daystamp as text

tell (enddate)
set daystamp2 to day
set monthstamp2 to (its month as integer)
set yearstamp2 to year
end tell

set daystamp2 to (texts -2 thru -1 of (“0” & daystamp2 as text))
set monthstamp2 to (texts -2 thru -1 of (“0” & monthstamp2 as text))

set formatenddate to yearstamp2 & “.” & monthstamp2 & “.” & daystamp2 as text

set new_item_name to formatdate & myname & ” ” & formatenddate & ” OS”
set the name of this_item to new_item_name

end repeat
on error error_message number error_number
if the error_number is not -128 then display alert “DEVONthink Pro” message error_message as warning
end try
end tell

So once I do all those things, I can use a smart group and sort the Spotlight Comment column chronologically to get an accurate sense of the chronological order in which publications discussed events.

This screenshot shows the difference – some of the English newspapers haven’t been converted yet (I’m doing it paper by paper because the papers were often published on different schedules), but here you can see how OS and NS dates were mixed in willy-nilly, say comparing the fixed Flying Boy and Evening Post with the yet-to-be-fixed London Gazette and Daily Courant issues.

DTPO Newspapers redated.png

Of course the reality has to be even more complicated (Because It’s History!), since an English newspaper published on January 1, 1702 OS will publish items from continental newspapers, dating those articles in NS – e.g., a 1702.01.01 OS English newspaper will have an article dated 1702.01.05 NS from a Dutch paper. So when I take notes on a newspaper issue, I’ll have to change the leading NS date of the new note to the date on the article byline, so it will sort chronologically where it belongs. But still.

The Flood continues

Anybody else notice the explosion in edited collections over the past X number of years? I assume it has to something to do with the publishing market, but I wouldn’t be surprised if changes in academia, namely the recent incentivization of frequent publishing in English higher ed, as well as various EU government funding streams, have encouraged lots of European scholars to host conferences and publish the results. But what do I know.

And by way of segue (note, not Segway), how about some recent publications in an EMEMH vein? How about if I put them in no particular order and provide almost no additional commentary?

Tracy, James D. Balkan Wars: Habsburg Croatia, Ottoman Bosnia, and Venetian Dalmatia, 1499–1617. Lanham: Rowman & Littlefield Publishers, 2016.

Davies, Brian L. The Russo-Turkish War, 1768-1774: Catherine II and the Ottoman Empire. London: Bloomsbury Academic, 2016.
Brittan, Owen. “Subjective Experience and Military Masculinity at the Beginning of the Long Eighteenth Century, 1688-1714.” Journal for Eighteenth-Century Studies 40, no. 2 (June 1, 2017): 273–90.
El Hage, Fadi. Vendôme : La gloire ou l’imposture. Paris: BELIN, 2016.
Close, Christopher W. “City-States, Princely States, and Warfare: Corporate Alliance and State Formation in the Holy Roman Empire (1540–1610).” European History Quarterly 47, no. 2 (April 1, 2017): 205–28.
Black, Jeremy. Plotting Power: Strategy in the Eighteenth Century. Bloomington: Indiana University Press, 2017.
Murdoch, Steve, Alexia Nora Lina Grosjean, and Siobhan Marie Talbott. “Drummer Major James Spens: Letters from a Common Soldier Abroad, 1617-1632.” Northern Studies 47 (December 2015): 76–101.
McCluskey, Phil. “ ‘Enemies of Their Patrie’: Savoyard Identity and the Dilemmas of War, 1690-1713.” In Performances of Peace: Utrecht 1713, 69–91. Leiden: Brill, 2015.
Probably the most military-themed of the dozen chapters, based off a conference of the same name.
Berkovich, Ilya. Motivation in War: The Experience of Common Soldiers in Old-Regime Europe. Cambridge: Cambridge University Press, 2017.
James, Alan. “Rethinking the Peace of Westphalia: Toward a Theory of Early-Modern Warfare.” In Aspects of Violence in Renaissance Europe, edited by Jonathan Davies. Ashgate Publishing, 2013.
Woodcock, Matthew. “Tudor Soldier-Authors and the Art of Military Autobiography.” In Representing War and Violence, 1250-1600, edited by Joanna Bellis and Laura Slater. Boydell Press, 2016.
Several other chapters in the collection deal with medieval warfare also.
Steen, Jasper Van der. Memory Wars in the Low Countries, 1566-1700. Leiden; Boston: Brill Academic Publishers, 2015.
Fulton, Robert. “Crafting a Site of State Information Management: The French Case of the Dépôt de La Guerre.” French Historical Studies 40, no. 2 (April 1, 2017): 215–40.
Manning, Roger. War and Peace in the Western Political Imagination: From Classical Antiquity to the Age of Reason. Bloomsbury Publishing, 2016.
Abel, Jonathan. Guibert: Father of Napoleon’s Grande Armée. University of Oklahoma Press, 2016.
Van der Linden, David. “Memorializing the Wars of Religion in Early Seventeenth-Century French Picture Galleries.” Renaissance Quarterly 70, no. 1 (2017): 132–78.
Asbach, Olaf, and Peter Schröder, eds. The Ashgate Research Companion to the Thirty Years’ War. Farnham, Surrey, England ; Burlington, VT: Ashgate Publishing, 2014.

Blakemore, Richard J., and Elaine Murphy. The British Civil Wars at Sea, 1638-1653. Boydell Press, 2017.

Linnarsson, Magnus. “Unfaithful and Expensive – but Absolutely Necessary: Perceptions of Mercenaries in Swedish War Policy, 1621–1636.” Revue d’Histoire nordique 18 (2015): 51–73.
Tolley, Stewart. “In Praise of General Stanhope: Reputation, Public Opinion and the Battle of Almenar, 1710-1733.” British Journal for Military History 3 (2017): 1–21.
Vo-Ha, Paul. Rendre les armes – Le sort des vaincus XVI-XVIIe siècles. Champ Vallon, 2017.
Forssberg, Anna Maria. “The Information State: War and Communication in Sweden during the 17th Century.” In (Re-)Contextualizing Literary and Cultural History, n.d.
Murphy, Neil. “Violence, Colonization and Henry VIII’s Conquest of France, 1544–1546.” Past & Present 233, no. 1 (November 1, 2016): 13–51.
Langley, Chris R. “Caring for Soldiers, Veterans and Families in Scotland, 1638–1651.” History 102, no. 349 (January 1, 2017): 5–23.
Ede-Borrett, Stephen. The Army of James II, 1685-1688: The Birth of the British Army. Helion and Company, 2017.
Sherer, Idan. Warriors for a Living: The Experience of the Spanish Infantry during the Italian Wars, 1494-1559. Brill Academic Publishers, 2017.
Houston, Amy. “The Faithful City Defended and Delivered: Cultural Narratives of Siege Warfare in France, 1553-1591.” Archiv Für Reformationsgeschichte/Archive for Reformation History 107, no. 1 (October 2016).
Paton, Kevin, and Martin Cook. “The 1560 Fortifications and Siege of Leith: Archaeological Evidence for a New Transcription of the Cartographic Evidence.” Post-Medieval Archaeology 50, no. 2 (May 3, 2016): 264–78.
And then we come to the editorial commentary.
Jacob, Frank, and Gilmar Visoni-Alonzo. The Military Revolution in Early Modern Europe: A Revision. London: Palgrave Pivot, 2016.
Sounds intriguing yes? I thought so too. So I bought it – $55 for hardcover isn’t too bad, I thought to myself. But what I failed to do, unfortunately, is to look closely at the page length. To save you the trouble, here’s a comparison of a few “randomly-chosen” books:
Pivot Photo.jpeg
Yep, I just spent $55 plus tax for a measly 101 pages (88 of actual text). The importance of an imprint.
For comparison, feel free to reread my earlier thoughts on EMEMH publishing, which seemed to be going in the opposite direction of costlier and deeper: here and here. It may just be me, but I’m not sure I like the direction of this Pivot.
We could apply the Ostwald Test: Historiography for Dummies, but I’m not sure what the pages-to-coverage ratio would be for a book that ranges from the Classical world to World War II, from Tenochtitlan to Mysore to Korea, and from Alexander the Great to Leopold III of Austria to Koxinga. All in 101 pages. Onnekink’s Reinterpreting the Dutch Forty Years War is a bit longer and more focused ($55 for 138 pages), but it’s the principle of the thing: I’d rather spend $100 for a 300-page book that delves into a subject I’m interested in.
Caveat emptor, man. Caveat emptor.
Addition: Forgot to mention that, on the Palgrave Pivot front, they are obviously trying to blur the distinction between book and article. Or maybe they’re just conceding that most people photocopy/scan individual chapters. Why might I think that? Hmm:
Onnekink Reinterpreting the Dutch Forty Years War 1672-1713 ch1 p1.png
It will be interesting to see if other publishers take up this model.

There’s gotta be a better way

In preparation for a new introductory digital history course that I’ll be teaching in the fall, I’ve been trying to think about how to share my decades of accumulated computer wisdom with my students (says the wise sage, stroking his long white beard). Since my personal experience with computers goes back to the 80s – actually, the late 70s with Oregon Trail on dial-up in the school library – I’m more of a Web 1.0 guy. Other than blogs, I pretty much ignore social media like Facebook and Twitter (not to mention Snapchat, Instagram, Pinterest…), and try to do most of my computer work on a screen larger than 4″. So I guess that makes me a kind of cyber-troglodyte in 2017. But I think that does allow me a much broader perspective of what computers can and can’t do. One thing I have learned to appreciate, for example, is how many incremental workflow improvements are readily available, shortcuts that don’t require writing Python from the terminal line.

As a result, I’ll probably start the course with an overview of the variety of ways computers can help us complete our tasks more quickly and easily, which requires understanding the variety of ways in which we can achieve these efficiencies. After a few minutes of thought (and approval from my “full-stack” computer-programming wife), I came up with this spectrum that suggests the ways in which we can make computers do more of our work for us. Toil, silicon slave, toil!

Computer automation spectrum.png

Automation Spectrum: It’s Only a Model

Undoubtedly others have already expressed this basic idea, but most of the digital humanities/digital history I’ve seen online is much more focused on the extreme right of this spectrum (e.g. the quite useful but slightly intimidating Programming Historian) – this makes sense if you’re trying to distantly read big data across thousands of documents. But I’m not interested in the debate whether ‘real’ digital humanists need to program or not, and in any case I’m focused on undergraduate History majors that often have limited computer skills (mobile apps are just too easy). Therefore I’m happy if I can remind students that there are a large variety of powerful automation features available to people with just a little bit of computer smarts and an Internet connection, things that don’t require learning to speak Javascript or Python fluently. Call it kaizen if you want. The middle of the automation spectrum, in other words.

So I’ll want my students, for example, to think about low-hanging fruit (efficiency fruit?) that they can spend five minutes googling and save themselves hours of mindless labor. As an example, I’m embarrassed to admit that it was only when sketching this spectrum that I realized that I should try to automate one of the most annoying features of my current note-taking system, the need to clean up hundreds of PDFs downloaded from various databases: Google Books, Gale’s newspaper and book databases, etc. If you spend any time downloading early modern primary sources (or scan secondary sources), you know that the standard file format continues to be Adobe Acrobat PDFs. And if you’ve seen the quality of early modern OCR’d text, you know why having the original page images is a good idea.

But you may want, for example, to delete pages from PDFs that include various copyright text – that text will confuse DTPO’s AI and your searches. I’m sure there are more sophisticated ways of doing that, but the spectrum above should prompt you to wonder whether Adobe Acrobat has some kind of script or macro feature that might speed up deleting such pages from 1,000s (literally) of PDF documents that you’ve downloaded over the years. And, lo and behold, Adobe Acrobat does indeed have an automation feature that allows you to carry out the same PDF manipulation again and again. Once you realize “there’s gotta be a better way!”, you only need to figure out what that feature is called in the application in question. For Adobe Acrobat it used to be called batch processing, but in Adobe Acrobat Pro DC such mass manipulations now fall under the Actions moniker. So google ‘Adobe Acrobat Actions’ and you’ll quickly find websites that allow you to download various actions people have created. Which allows you to quickly learn how the feature works, and to modify existing actions. For example, I made this Acrobat Action to add “ps” (primary source) to the Keywords metadata field of every PDF file in the designated folder:

Screenshot 2017-05-10 18.52.17.png

I already copied and tweaked macros and Applescripts that will add Keywords to rich text files in my Devonthink database, but this Adobe solution is ideal after I’ve downloaded hundreds of PDFs from, say, a newspaper database.

Similarly, this next action will delete the last page of every PDF in the designated folder. (I just hardcoded to delete page 4, because I know newspaper X always has 4 pages – I can sort by file size to locate any outliers – and the last page is always the copyright page with the nasty text I want to delete. I can, for example, change the exact page number for each newspaper series, though there’s probably a way to make this a variable that the user can specify with each use):

Screenshot 2017-05-10 18.52.43.png

Computers usually have multiple ways to do any specific task. For us non-programmers, the internet is full of communities of nerds who explain how to automate all sorts of software tasks – forums (fora?) are truly a god-send. But it first requires us to expect more from our computers and our software. For any given software, RTFM (as they say), and then check out the software’s website forum – you’ll be amazed at the stuff you find. Hopefully all that time you save from automation won’t be spent obsessively reading the forum!

End of the semester

Which means I can return to the blog. Why so long without a post? The usual suspects: teaching three courses (note-to-self: teaching a course requiring three new class preps per week for an entire semester gets really old, even if it’s the Enlightenment); revising a think-piece book chapter on what we mean when we use the term “strategy”; revising my chapter on siege capitulations and otherwise editing the other chapters in the World of the Siege collection; thinking about the battle book; assistant chairing and scheduling; designing and overseeing the creation of a Digital History Lab; splitting my Devonthink databases into separate course databases and setting up my Devonthink To Go databases on the iPad/iPhone; downloading a ton of Google Books PDFs; and starting preps for a new Intro to Digital History course this fall.

But motivated by all the digital tips and tricks I’m learning, I’ll try to make more frequent posts for the blog over the summer. That will include posting a few examples of the new digital toys.

So stay tuned…