Seriously though. I’ve known about the concept of ‘regular expressions’ for years, but for some reason I never took the plunge. And now that I have, my mind is absolutely blown away. Remember all those months in grad school (c. 1998-2000) when I was OCRing, proofing and manually parsing thousands of letters into my Access database? Well I sure do.
Twenty years later, I now discover that I could’ve shaved literally months off that work, if only I’d adopted the regex way of manipulating text. I’ll blame it on the fact that “digital humanities” wasn’t even a thing back then – check out Google Ngram Viewer if you don’t believe me.
So let’s start at the beginning. Entry-level text editing is easy enough: you undoubtedly learned long ago that in a text program like Microsoft Word you can find all the dates in a document – say 3/15/1702 and 3/7/1703 and 7/3/1704 – using a wildcard search like 170^#, where ^# is the wildcard for any digit (number). That kind of search will return 1701 and 1702 and 1703… But you’ve also undoubtedly been annoyed when you next learn that you can’t actually modify all those dates, because the wildcard character will be replaced in your basic find-replace with a single character. So, for example, you could easily convert all the forward slashes into periods, because you simply replace every slash with a period. But you can’t turn a variety of dates (text strings, mind you, not actual date data types) from MM/DD/YYYY into YYYY.MM.DD, because you need wildcards to find all the digit variations (3/15/1702, 6/7/1703…), but you can’t keep those values found by wildcards when you try to move them into a different order. In the above example, trying to replace 170^# with 1704 will convert every year with 1704, even if it’s 1701 or 1702. So you can cycle through each year and each month, like I did, but that takes a fair amount of time as the number of texts grow. This inability to do smart find-replace is a crying’ shame, and I’ve gnashed many a tooth over this quandary.
Enter regular expressions, aka regex or grep. I won’t bore you with the basics of regex (there’s a website or two on that), but will simply describe it as a way to search for patterns in text, not just specific characters. Not only can you find patterns in text, but with features called back references and look-aheads/look-backs (collectively: “lookarounds”), you can retain those wildcard characters and manipulate the entire text string without losing the characters found by the wildcards. It’s actually pretty easy:
Yep, it’s been a computational summer. Composed mostly of reading up on all things digital humanities. (Battle book? What battle book?) Most concretely, that’s meant setting up a modest Digital History Lab for our department (six computers, book-microfilm-photo scanners, a Microsoft Surface Hub touch display, and various software), and preparing for a brand new Intro to Digital History course, slated to kick off in a few weeks.
I’ve always been computer-curious, but it wasn’t until this summer that I fully committed to my inner nerdiness, and dove into the recent shenanigans of “digital humanities.” Primarily this meant finally committing to GIS, followed by lots of textual analysis tools, and brushing up on my database skills. But I’ve even started learning Python and a bit more AppleScript, if you can believe it.
So, in future posts, I’ll talk a little less about Devonthink and a bit more about other tools that will allow me to explore early modern European military history in a whole new way.
Our household has been in a bit of a spring cleaning vibe (new bookcases will do that), which inspired me to get rid of a bunch of old electronics dating from the Pleistocene. In addition to recycling some pocket electronics (an old digital recorder and an old Dell Digital Jukebox MP3 player – and where or where did my old c. 2004 Dell Axim go?), we also are unloading one very old (486?) PC and a bevy of laptops, which made me briefly reminisce on all the laptops I’ve loved, and hated, before (sung with a Willie Nelson twang): Read More…
Historical research, as most of us know, has traditionally been a solitary practice. Even in this postmodern age of killa’ collabs and remixes with co-authors named feat., historians, by and large, are still a lonely bunch of recluses. Admittedly, one’s choice of subject has a lot to do with how crowded your subfield is. Unfortunately (or not?), I’ve rarely been in a position where I knew somebody else who was actively researching the same war as me (War of the Spanish Succession) and might want to look at the same sources. John Stapleton is the closest example from my grad school days, and he focuses on the war before “mine,” so we’ve given each other feedback and pointed each other to various sources from “our” respective wars over the years. In general, though, it’s been kinda lonely out here on the plains.
But the times they are a-changin’ and the prairie turf is being transformed into suburban subdivisions. The question is whether all these houses will follow a similar aesthetic, whether their architecture will reference each other, or whether the only communication between neighbors will consist of vague nods at the grocery store and heated arguments over how far their property line extends. (Thus far, subdivisions are still segregated into ethnic neighborhoods.)
If we look beyond the discipline of History, we’re told that it’s an age of collaboration (CEOs say they want their new employees to work effectively in teams) as well as the age of information overload (I believe that – my main Devonthink database has grown to 104,000 documents and 95 million words of text). Even the other kind of doctors are having a rethink. Now this whole Internet thing allows like-minded individuals to communicate and commiserate across the planet, and not just with their neighbor next door. “Global village” and all that. As a result, even historians have figured out that we can now find out if we’re alone in the universe or not – I assume everybody has Google Alerts set for their name and publication titles? This academic version of Google Street View certainly has certainly expanded my worldview. My one semi-regret is that, thanks to online dissertations, conference proceedings and even blogs, I now find out I was in the archives 10-15 years too early, and there are currently a bunch of people both American and Euro looking into the period – and by “bunch” I mean maybe 6-12. Even more reasons for making connections. Hmmm, someone should create a blog that allows EMEMH scholars to communicate with each other…
So how should historical research work in this interconnected digital age, in this global, digital village? In an age when the moderately-well-heeled scholar can accumulate scans of thousands of rare books and hundreds of archival volumes? The combination of collaboration and digitization has opened up a spectrum of possibilities, and it’s up to us to decide which are worth exploring. Here are some possibilities I see, stretching along a spectrum from sharing general ideas to swapping concrete primary sources (Roy Rosenzweig undoubtedly predicted all this twenty years ago):
- Topic Sharing. The way it’s traditionally been done, in grad school, or if people meet up in the archives or at a conference or on fellowship. You let people know the specific topics you’re working on, and let it progress from there: “Oh, you’re working on X. Do you know about …? Have you checked out Y? You should really look at Z.” This has two advantages: first, it allows participants to keep the details of their research close to the vest, and more fruitfully, it allows the historiography to develop into a conversation rather than separate ships passing each other in the night – it’s such a waste when something gets published that really should have looked at X, Y or Z, but nobody suggested it. Or, perhaps peers studying the same period/place offered comment, but other potential-peers studying the same theme didn’t (or vice versa). Sharing subjects also forces people to acknowledge that they might not be the only person writing on topic X, and encourage them to consider whether they might want to divvy up topics rather than writing in ignorance of what others will be publishing, or already have written. Say, hypothetically, when one thinks they want to write a chapter about how the French viewed battle in the War of the Spanish Succession, and then discover that another scholar has already written about a thousand pages on the subject. So letting others know what you’re working on would be a start: type of history, subject (sieges? battles? operations? logistics?…), type of study (campaign narrative? commander biography? comparison of two different theaters?…), sides/countries (including languages of sources being used), and so on.
- Feedback and advice. This requires longer and more sustained interaction, but is far more useful for all involved. I’m not convinced by the latest bestseller claiming that the crowd is always right, but crowdsourcing certainly gives a scholar a sense of how his/her ideas are being received, and what ideas a potential audience might like to read about in the first place.
- Research assistance. Here, I would suggest, is where most historians are still living in the stone age, or more accurately, are on the cusp between the paper and digital ages. Most of our precious historical documents survive entombed within a single piece of paper(s), in an archive that may require significant costs and time to access. Depending on a government’s view of cultural patrimony and the opportunity for a marketable product, a subset of those documents have been transferred to the digital realm. But not many. This is where many historians need help, a topic which we’ve discussed many times before (as with this thread, which prompted the present post), and where collaboration and digitization offer potential solutions to the inaccessibility of so many primary sources.
But there is a rather important catch: copyright. Archives and libraries (and publishers, of course) claim copyright over the documents under their care, and they frown upon the idea that information just wants to be free (ask Aaron Swartz):
So this puts a bit of a kink in attempts to create a Napster-style primary source swap meet – though I am getting a little excited just imagining a primary-source orgy like Napster was back in the day.
Fortunately there are steps short ofscofflawery. Most of these revolve around the idea of improving the ‘finding aids’ historians use to target particular documents within the millions of possibilities. These range in scale from helping others plan a strategic bombing campaign, to serving as forward observer for a surgical strike:
- A wish list of specific volumes/documents that somebody would like to look at. This could be as simple as having somebody who has the document(s) just check to see what it discusses, whether it’s worth consulting. This, of course, requires a bit more time and effort than simply sharing the PDF.
- Or it might mean providing some metadata on the documents in a given volume. For example, I discovered in the archives that if the Blenheim Papers catalog says that Salisch’s letters to Marlborough in volume XYZ cover the period 1702-1711, and I’m studying the siege of Douai in 1710, it is a waste of one of my limited daily requests to discover that Salisch’s letters include one dated 1702, one from 1711, and the rest all on 1708. The ability to pinpoint specific documents would in itself be a boon: many archives have indexes and catalogs and inventories that give almost no idea of the individual documents. Not only would it save time, but it might also save money if you want to order copies of just a few documents rather than an entire volume.
- Or, such assistance could be as involved as transcribing the meaty bits of a document. Useful for full text, though purists might harbor a lingering doubt about the fidelity of the transcription.
- Or, it might mean running queries for others based off of your own database. I did that for a fellow scholar once, and if you’ve got something like Devonthink (or at least lots of full-text sources), it’s pretty easy and painless. Though if there are too many results, that starts to look a bit like doing someone else’s research for them.
Of course with all of these options, you have to worry about thunder being stolen, about trusting someone else to find what you are looking for, etc., etc. And there probably isn’t a good way to assuage that concern except through trust that develops over time. And trust is based on a sense of fairness: Andy’s questions about how to create a system of calculating non-monetary exchanges have bedeviled barter systems for a long time, I think.
As usual, I don’t have a clear answer. Simple sharing of documents is undoubtedly the easiest solution (cheapest, quickest, fewest number of eyes between the original source and your interpretation), but I don’t have a system for the mechanics. Nor am I clear on the ethical issues of massive sharing of sources – is “My thanks to X for this source” in a footnote enough? If some documents are acquired with grant funds, can they be freely given away? And the list goes on…
New article in Social Science Computer Review using GIS to analyze the 1714 siege of Barcelona.
I also have the number of daily workers, so a casualty rate over the length of the siege could easily be calculated.
And, finally, a colorful map that emphasizes the importance of musketry for the defense:
Now I remember why it took me so long to finish my dissertation – because I wrote 1.5 of them instead of just one.
As academics on a semester system know, Thanksgiving break offers the false hope of a brief interlude before the final dash to the end of the semester. Thus I surfaced for air long enough to waste some time playing around with a few new-ish digital toys that might be of interest to others.
First, for those who use Pocket Informant’s calendar/task-management program, their recent update includes a macro-view (all the cool kids are doing Big Data these days) of your schedule, a heat map indicating how busy your days are over months. As you can tell from the screenshot, I follow the stereotypical academic’s schedule of attempting to keep my summers for my research.
More productively, I decided to waste some more time on mind mapping software. Devonthink is great for storing all my documents and notes, but I still find the need for meta-notes (or organizational cues, or trains of thought) that are extremely hierarchical, and which have to come in a very specific order even if I don’t know where exactly they should go in the overall argument – often these are a series of successive questions that I need to follow up on. You could put them in a group in DT, but that tends to lose the specific train of thought. So instead of pulling out my big sketchpad and writing out a mindmap of my battle book, as I did with my diss, I got a copy of Xmind software. This way I can have my mindmaps everywhere I am, and I can move things from one node to another without having to erase and rewrite. The resulting map for a smaller project (my honor in sieges book chapter) looks like this:
The map is fully searchable, you can add various ‘markers’ and icons, modify the formatting of each point, add images, create floating points (when you’re not yet sure where exactly they should go), and it automatically makes an outline that you can export (upper right in screenshot). I find it useful to see the big picture on a single page (scrolling and zooming in and out as necessary), and to quickly see the ‘shape’ of the argument and the relative amount of detail in each section, rather than flip between a dozen pages of outline and try to imagine how a subpoint would fit in a different spot.
Finally, my frequent reliance on timelines in my courses led me to take the plunge and explore timeline software. My über-efficient timecharts have their uses, but I don’t want to put that amount of effort into all sorts of chronologies in the dozen different courses I teach. Sorry, but the 20th century isn’t worth that much effort. And for my own research purposes, the more info in a given timeline, the greater the need to have the info quickly searchable.
Enter Aeon Timeline. Items are generally divided between Entities (people, institutions, technologies…) and time-defined Events. You can use different levels of precision for different Events, and you can place Events on various arcs, e.g. an operational timeline might include separate arcs for each theater of operations. Befitting the digital data, all entries and metadata are searchable, and the timelines are zoomable in both directions. You can add notes to each Entity and Event, and there are a few limited formatting options (with possibly more to come in future versions). So in the operational arcs I indicate the Allied sieges with a red font and the Bourbon sieges with a blue font; in the English politics arcs I use buff to indicate the Whigs and blue to indicate the Tories. You can import images, for example peoples’ portraits or even simplified maps of battles and sieges. You can also filter your results to show only a subset of the events and entities, based off of the metadata. You can also import in massive quantities of data in csv or tab-delimited, rather than use the individual event creation dialog box.
Further, you can define a Relationship between each Entity and each Event – e.g. an Entity might have one Event that was its birth, another its death, while another Event of that Entity (say, a person) might be that individual’s participation in a particular siege. This view is a bit messy in the Event (top) half of the window – you should primarily just look at the bottom half, in the Relationship view, which allows you to see all the events that each entity was involved with – and even how old the given Entity was, if you want. The developer promises to make this view more intuitive in future versions. And, if I were ever to make my own WordPress blog site (i.e. not use wordpress.com), I could export the timelines in simile format and post interactive versions online.
So that’s how I spent my Thanksgiving week, when not eating turkey, that is.
Interesting NY Times story on the increasing use of scribes by physicians – you know, those who claim to be “doctors.”
Three weeks of training gets you a scribe that follows you around with a laptop in hand and takes notes on your interactions with patients, with the scribe company charging $25 per hour ($8-$16 for the scribe). Sounds like something academics could use: there don’t seem to be nearly enough research assistants floating around. Only problem: that going rate is a bit high.
Apparently all the computerization is one of the biggest complaints among physicians. A money quote from the article: “a recent article in the journal Health Affairs concluded that two-thirds of a primary care physician’s day was spent on clerical work that could be done by someone else; among the recommended solutions was the hiring of scribes.”
From one doctor to another, I hear ya. Though History must be more challenging, because I’ve had limited success getting some of my department’s past office workers to do much more than photocopy.
Computerized medical records were supposed to make everything efficient, but I guess they forgot the lowly data-entry clerk. I didn’t. So now we’re going back to the days when secretaries actually did typing for doctors, at least the medical kind. Funny how technology sometimes takes you in circles.