Archive | Resources RSS for this section

What EXACTLY Python does

Given my impassioned proselytization of digital history, it’s not surprising that I received an email from a colleague asking a reasonable, yet hard-to-answer question: “what EXACTLY does Python do?” His name’s not important, but let’s call him “Bob.”

So if you’ve bothered reading any of my recent posts, and just wish I’d start from the beginning (or shut up, already), here’s my take on Python and computer programming, geared towards computer-knowledgeable people who haven’t programmed before. This is the perspective of someone who’s been a digitally-savvy humanities-type for thirty years, but has only recently dived into learning a computer programming language from scratch. Someone, keep in mind, who’s never gotten the official version from a computer science course, but who does teach a digital history course. So it’s therefore a focus on all the small and medium-sized things you can have your computer do for individual historical research – the digital historian’s low-hanging fruit.

What EXACTLY does a computer program do?

Python is a “high-level” general-purpose programming language. Which means, it uses a syntax that is ‘relatively’ readable, unlike the machine code used by the Matrix, and it can do ALL sorts of things with just about any kind of data you can imagine. But, unlike the Matrix, you can’t really bend or break the rules. At least I can’t.

binary code zero one matrix green background beautiful banner wa

NOT Python code

So asking what ‘exactly’ a programming language like Python does doesn’t emphasize enough the fact that it does just about anything you could want, ASSUMING you have some digital information, AND assuming you want to do something with/to that data that can be translated into a series of discrete steps (an algorithm). In other words, there’s a lot of gold in them thar’ hills. But it’s only worth prospecting if you have (or can get) a lot of digital data, and need to repeat the same kinds of manipulations and analysis over and over: for each character/word/paragraph/document, I want to do X, Y, Z; if it says ‘blah’ I want to… (I just introduced you to for- and if-loops, in case you didn’t notice.) As long as the above two assumptions are met (nice to meet you, Boolean value = True), you’re golden.

Imagine that you were asked “what exactly does Microsoft Excel do?” Instrumentally, you’d probably reply that it allows you to type in numbers, edit the numbers with features like copy and paste, format the numbers in a variety of ways, perform mathematical calculations on the numbers, and print out or email the results. Or, you could answer the question by giving specific examples of projects you could do: Excel allows me to balance my budget; it lets me keep track of student grades, or my fantasy league stats. Good answers all, but just the tip of the iceberg. You could create an entire Excel spreadsheet that doesn’t have a single number in it. You could use Excel to lay out a particularly complicated table that Word’s Table feature chokes on, or that Word generally makes a pain in the arse to edit. You could, like I did twenty years ago, use Excel as a go-between when transferring Word notes into a note-taking Access database. (See! that Python post does have some utility after all.) You could even use Excel to rename a whole folder full of files based off another list, e.g. rename all those EEBO-TCP files you downloaded, which have names like A10352.txt. In a sense, that literally has almost nothing to do with spreadsheets – you delete the Excel file when you’re done, because it was just a means to an end. In other words, what “exactly” Excel does is limited most broadly by the features built into the application, but, more practically, it depends on the vision and technical expertise of the person using it.

Same with a computer programming language. But the Python canvas starts out with nothing at all on it, not even a grid of empty cells. Intimidating, but its tabula rasa lets Python deal with all sorts of different types of information, and allows you to manipulate them in an almost infinite number of ways, because you can keep building layer upon layer of functions on top of each other, drawing in data and inputs from all sorts of places, and sending them wherever you want. So, if you can think of a specific series of commands you invoke using some packaged software like Word or Excel or your web browser, you can (probably) recreate that workflow in Python. But it’s not just about replication; I see three main advantages to coding over a packaged piece of software:

  1. Coding allows you to totally control the inputs, the manipulations, and the outputs. Don’t like the way a program formats the results? Change it. Wish the summary feature would also include your favorite statistic? Include it. Hate how the program always does this one annoying thing? Fix it.
  2. Performing the task will be automated with Python, so you could run it thousands of times, in a few seconds (or more, depending on the complexity of the task), without getting carpal tunnel syndrome. Some packaged programs allow you to install or use macros to ‘record’ these repeated actions, and programming is like that, but so much more, because you have so much more control, and a much larger toolbox to draw upon. The result is a quantitative increase in speed, but that quantitative increase is really a qualitative advance – you can do things that you’d never do otherwise, because the computer does most of it for you.
  3. You don’t need to purchase, learn and maintain the dozen different programs that would be needed to to do (most of) the data analysis you can perform with Python. Nor do you need to worry about your specialized programs disappearing into history, as happens shockingly often unless your name is Microsoft or Apple or Adobe. Nor do you need to worry about radical changes from version to version. If you’ve experienced several generations of Microsoft Office or Blackboard or, heaven help you, mobile apps, you know what I mean. Python will definitely require you to keep an eye on changing Python library versions, and possible incompatibilities that might result. But at least with Python, you can create virtual environments so that you can run different versions of the same libraries on the same machine. And there are even web environments that allow you to set all of that up for the user.

Python, in other words, allows you to create workflows that automate a lot of the processing and analysis steps you’d have to do manually, possibly in many different programs. Input data, manipulate it in any number of ways, and then output it.

But it definitely comes with a learning curve, I won’t lie. A learning curve that is currently a lot harder for humanists, in my humble opinion, because we are not the primary users, and therefore we don’t have many pre-made, history-specific libraries with functions already designed. As a result, I’ve spent most of my time learning the fundamentals; the eye candy of maps and charts is much better served by dedicated libraries that require less programming knowledge and more domain expertise. My experience over the past several months has reinforced three important programming principles that I’m sure Programming 101 courses emphasize:

  1. Think through the nature of your data at each point in your code.
  2. Think through each logical step in your algorithm.
  3. Pay constant attention to the numerous syntax rules.

Ignoring any of the above will lead to runtime errors. Fortunately, you can google those error messages and probably figure out the problem, since it will likely be a violation of one of the above principles. Are you trying to use string methods on a list item? Are you expecting your iterator to be an iterable? Do you really want to put the print function inside that loop? Are you blindly copying an online example without considering how your case is different? And, as I keep telling my students, attention to details matter, whether it’s a missing close quote or a mismatched variable name. But the computer will usually tell you if you make a mistake, you learn these things over time (and build up working code that you can copy or use as functions), and practice makes perfect.

What EXACTLY does Python do for Scholars?

For scholars, a programming language like Python can help manipulate and analyze whatever text, numbers, image, video and sound information we have in digital form. Fortunately, it’s not as minimalist as all that, thanks to a few thousand free libraries (or modules) that you import into the basic Python package. The libraries do most of the hard work behind the scenes, i.e. most of the coding has already been done by the libraries’ authors, so you just plug and play with some standard commands, your specific data, the specific variables you create, and the specific commands combined in the specific order you choose. Figure out how to get your information into the code (i.e. the computer’s) memory in a specific format based on its structure (is it a string? a list? a dictionary?…), manipulate the resulting data (maybe you convert the string to a list, after replacing certain features and adding others), pass it on to the next block of code that does something else (now that it’s a list, you can loop through each item and count those that start with the letter ‘q’), do more things to the list and the count (maybe if the count exceeds a certain threshold, you send that word to another list), then pass any of those results on to the next block of code, until you end up with what you want. To give a simple real-world example: maybe you’ve started with a long string of text (like this paragraph here), you tokenize it into a list of individual words (deciding how you want to deal with punctuation and contractions), then count the words according to the letter they start with, then plot a histogram showing the frequency of each letter.

Python is also attractive to scholars because it’s free – insert joke about poor professor here. Its costlessness and open source ethos have encouraged hundreds of people to create free general libraries focusing on particular types of data and particular types of analysis, along with specialized domain libraries for astronomers, for geographers, for audiologists, for linguists, for stock market analysts… There is also a massive number of tutorials available online. Every year there are a dozen Py conferences held all over the world, and several hundred of the presentations are available on YouTube, including numerous 3-hour tutorials for beginners. You can also check out the Programming Historian website, which has numerous examples in Python. There are numerous cautions with programming (in our case, use Python 3+, not 2.7…), but there are lots of resources that discuss those. Plenty of ways to get started, in other words.

A final benefit of particular importance for humanities-types is Python’s ability to convert words into numbers, usually behind the scenes, and highlight patterns using various statistical properties of text. Such powerful text functions allow businesses to data mine tweets and online content; business demand seems to have juiced computer science research, leading to lots of advanced natural language processing (NLP) features, on top of those driven by the (older) linguistic and literary interests of academics. So, if you have a lot of digitized text or images and you want to clean/analyze them beyond just reading each document one by one, or manually cycling through search results, one… hit… at… a… time…, then Python is worth a look.

What EXACTLY does Python do for this Historian?

So here’s a list of the python projects I’ve been working on, and those I will be working on in the future. A few are completed, a few have draft code, a few have some ideas sketched out with snippets of code, and a couple are still in the fantasy phase. Many use the standard functions of off-the-shelf libraries, while others require a bit more custom coding. But they all should be viable projects – time will tell.

  • Semi-automate a book index: Find all (okay, maybe most) of the proper nouns in a PDF document, along with which PDF page each occurred on, then combine them together into a back-of-the-book index format. If you don’t want to pay $1000 to have your book professionally indexed, you could use Word’s or Adobe’s indexing feature, which requires you to go through every sentence and identify which terms will need to be indexed. Or, you can get 85% of that with Python’s NLP (natural language processing) libraries, or you can import in a list of people/places/events and it will find those for you. As with all programs, things will get complicated the more edge and corner cases you try to address: Do I need to include a “he” on the next page with the full name on the previous page? Do I just combine together all consecutive pages into 34-39, or do I need to judge the importance of the headword to each page’s discussion? Tough questions, but this code will, at the least, give you a basis from which to tweak. And, judging from recent indexes in books published by highly-reputable academic presses, nobody cares about the ‘art’ of indexing anymore: some are literally headwords with a giant undifferentiated list of several dozen page numbers separated only by commas; some don’t even provide any kinds of topics, only proper nouns. Of course, the index may well die as more works are consumed digitally…
  • Web scraping: Automate downloading content from a website, either the text on pages, entries in a list, images of battle paintings from Wikipedia, or linked files… Maybe automate the SPARQL query on historical battles I posted about awhile back. Or download a bunch of letters from a site that puts each letter on its own separate page. Maybe automate scraping publication abstracts from a website based off records in Zotero. (with a library like beautifulsoup)
  • Web form entry: I’d like to create code that would automate copying bib info from Zotero (author, title, date, pages, etc.) and then paste it into our library’s online ILL form in the respective webform fields, which, of course, aren’t in the same order as the Zotero field order. That means a bunch of cutting and pasting for every request.
  • Look up associated information on an entity (person, place, organization) with Linked Open Data, e.g. find the date of birth for person X mentioned in your text document via the web. (rdflib)
  • [Added by request – get it?] APIs: Numerous institutional websites allow you to access their online data more directly through API, rather than using brute force scraping to harvest information from their pages’ HTML. Those websites have APIs (Application Program Interfaces) to allow more sophisticated downloading of information. You can search their site for ‘API’ to see if they offer it. (requests)
  • Convert information into data: My two previous posts on AHA department enrollments and parsing long notes illustrate how this can be done with Python. Code that converts information into data is particularly important low-hanging fruit for historians, since we lack a lot of already-digitized datasets – this kind of code allows us to create them with our own data.
  • Clean dirty OCR text: Correct OCR errors, and generally make a document more readable by humans and computers. A good, detailed description is Ryan Cordell’s Q i-jtb the Raven: Taking Dirty OCR Seriously. This requires a lot of hands-on work with the code, which I’ve been doing of late. E.g. find every occurrence of ‘Out-work’ and convert it to ‘outwork’ – so we can count them all the same way. Find every misOCRed ‘Mariborough’ and convert it to ‘Marlborough’ – there are big lists of common errors available to make this search-and-edit process a bit more precise. But since you’ll never guess every word that might be hyphenated due to line endings, finding every hyphenated word (like ‘be- siege’) and converting it back (to ‘besiege’) is easy enough, with regular expressions. You can even create a list of all the changes your code makes (i.e. create a dictionary with each ‘mistake’ and what it was changed to), if you want to audit the process. More difficult are the Questions of Capitalizations (case), especially when our 17-18C Authors liked to capitalize lots of Nouns, yet modern NLP uses Capitalization as one of its Clues for identifying Proper Nouns (Named Entity Recognition). Ideally you’d have this code as a series of functions so that you can run the various corrections across an entire folder of documents, based on what that source needs. Then, you could use more code to check for any other errors or documents that require special handling.
    I’d argue that this is currently the most critical area for digital history, the bottleneck, in fact. So few historians have their own sources in clean full text, yet it’s also a very idiosyncratic thing to program, based on historically-variant word usage and widely-varying source genre vocabulary, as well as the sometimes-random errors derived from OCRing irregularly set type from a few hundred years ago. (It’d be great if OCR accuracy rates were 100%, but that ideal would seem to require having high-quality scans of the originals, which your average scholar does not have, and will likely never acquire, because we don’t actually own the originals). As a result, cleaning historical OCRed text is the area with the least amount of pre-made code available – big projects like EEBO and ECCO paid cheap foreign labor to type theirs by hand. We should also note that lots of social scientists, for example, talk about ‘preprocessing’ text, but by that they mean standardizing the spelling of text that’s been born digital, making everything lowercase, stripping out punctuation, etc. Historians need a lot of pre-preprocessing first because we are dealing with imperfectly OCRed text. And if we want to retain a cleaned text copy of the original, and not just atomize the text string into a list of word tokens, then it’s even more complicated. Suggestions welcome! Ted Underwood has provided some useful ideas in various venues dealing with big data (10,000s of texts), but his cleaning code on GitHub is a bit above my skill level.
  • Quantitative analysis: Load your spreadsheet into the pandas library and run stats, plot charts… (pandas, matplotlib, seaborn, bokeh for interactivity)
  • Make interactive visualizations, create interactive websites with your data, and so on. (bokeh…)
  • Create a visual timeline, drawn from data extracted from a document. Possibly interactive. Haven’t explored this yet, but I will.
  • Mapping: Quickly make maps, including small multiples, in whatever projection, with whichever features you want to display. Look up coordinates (aka geocode) and calculate spatial/topological relationships… It’s good to see that there are several international project teams working on historical gazetteers, and a number of groups have georeferenced some early modern maps as well. (geopandas, cartopy…)
  • Network analysis: create a network ‘graph’ (diagram) of entities and relationships between those entities (nodes and edges), and measure the topological properties of the network, who are the hubs, the spokes, the nodes… (networkx)
  • Relational database interactions with SQLite and MySQL. (sqlite3) I believe you can do the same with graph databases (triplets of subject-verb-object), but I haven’t really looked at those.
  • Zotero: Update and manipulate your Zotero records. I can already read Zotero data into Python and hand it off to other libraries for analysis, as well as go in the other direction, e.g. mass update fields back in Zotero. But I’d like to create code that will take a PDF page (or two) of a book’s table of contents and enter a separate record for each chapter in the book into Zotero, automatically adding in all the other book info. (pyzotero)
  • Dates: Automatically calculate duration between two dates, convert between OS and NS and other historical calendars, look up “last Tuesday’s” date when mentioned in a letter written on July 7, 1700 – you could probably automatically insert that date into the source document if desired. Maybe even do a quick calculation to see how few days you have left on your sabbatical… (calendar, datetime, dateutil, convertdate, dateparser, arrow)
  • Textual analysis: This is a biggie for historians. Create a corpus of texts in your area; create a list of people, places and events to use for extraction; cluster works (or segments) together by the topic they discuss; see how often different authors/texts use particular words/phrases; identify which words tend to be collocated with which other words; keywords-in-context; sentiment analysis; etc. Did I mention fuzzy searching, finding words that are spelled similarly? Or finding words that are used in the same context as a given word? Maybe you want to analyze your own prose: which words/phrasings/grammatical structures do you overuse? (NLTK, spaCy, textacy, gensim, word embeddings…)
  • Bibliometrics and Historiographical Analysis: From a secondary source, extract all the people, publications, places and time periods/dates mentioned, and graph/map them, before comparing them with other authors. Or analyze the sources cited in the bibliography – authors and affiliations, years of publication, languages, etc. The sciences have a lot of this already because they mostly publish journals and they’re in databases like Web of Science. This also ties into network analysis, especially if you want to look at citation networks.
  • Analyze words/phrases from the 16m-book HathiTrust collection. There’s a website for that, but you can also download the data, or subsets at least.
  • Genealogy: parse genealogical data and analyze. Would be interesting for royal lineages, and some work has already been done on that.
  • Sound analysis. Haven’t played with these, but some people are into reconstructing soundscapes and the like.
  • Image classification and analysis: Group together all the portraits of person X, etc. Haven’t played with these, though you have similar classification algorithms in Facebook, etc.
  • Lots of full-fledged programs are also python scriptable. E.g. both ArcGIS and QGIS have python interfaces, which means you can automate many of the boring tasks you need to perform when making more sophisticated maps.
  • Clean up your computer files – batch rename, copy, delete, convert, etc., with much more flexibility than Mac OS X’s Rename function.
  • Automate lots of administrative school work. Create a syllabus class schedule that lists the day of week and date for each meeting during the semester, removing any holidays or other days off for that specific semester. I’ll be department chair next year, and there are lots of stats and reports on enrollment, assessment… that I’d like to automate: collecting the data from databases/surveys and then analyze them, without me having to manually repeat the entire process every time. A computer science colleague will have a student working on a course scheduler next semester – given a department’s faculty requests, the available timeslots, and a few dozen university and departmental scheduling requirements, come up with a schedule that meets all those criteria, or at least the most important ones. Now, we have to do this by hand (with Excel, but still), and it’s a real pain.
  • Machine learning/AI: Python is also one of the main languages used for this new burgeoning field. For historians, that might mean classifying documents and topics, but I haven’t looked into it enough to think about how it could be used. Some of the above-mentioned libraries might well be superseded by machine learning libraries in the future, where things like neural nets figure out their own algorithms without rules being specified by the programmer. I think we’re already seeing a little bit of that with NLP.

And those are just a few of the things you can do with Python! So whatever data-related project you can think of, there’s probably a way to do it in Python. It’s not just automating the things that you find yourself doing on the computer over and over and over and over again. Just as important, what are the research questions that you want to ask, especially those that would require a lot of drudgery like counting and sorting and revising thousands of documents? Any software package that will answer that particular question for you will have its own learning curve, and there probably aren’t many people whom you could hire to do it for you, so you will probably be on your own. Whatever your question, there’s likely a way to combine the various Python tools together in a way that gets you the desired output.

But that’s not all. Using code also means:

  1. You can take whatever output and turn it into the input for another bit of code, and so on and so on. It is practically infinitely extensible.
  2. You can rerun your code but change a parameter, to see the difference it makes. ‘What-if’ exploration is super simple, and you can easily change a parameter anywhere in the workflow and continue the rest of your code with the new results.
  3. When you’re all done with your code, you can run it on another data set or text, or a whole folder full. And then you can compare the results.
  4. When you notice an intriguing pattern in one of your sources, you can quickly add another bit of code to explore it. Then you can look for that pattern in your other documents.
  5. You will also have a record of your process and method: which data you used for which analysis, how you cleaned the data, which settings and parameters you used, the order in which you performed your various steps, and so on. I’m guessing that more than a few historians would be unable to repeat, much less explain, how exactly they got the results they did. How faithfully, for example, do we record our computer-based research workflow? Some grant agencies are beginning to require recipients submit their data and workflow, along with their results. “Replicability” could even come to mean something in History.

This historian’s “killer app” for Python is a program that reads in a (primary or secondary) source from a text file, and then the code provides statistics on the words and phrases used, identifies rare terms that are unusually common in that document (compared to some corpus), extracts all the proper nouns mentioned, provides a statistical overview of their frequency (overall, and by section of book…), looks up information on the people (say, their nationality, age, etc.), then looks up the coordinates of mentioned places and maps them according to some criteria (by person who mentions the place, by where in the text it is mentioned, by what other things are mentioned around that place…). One output of all this could be tables or graphs of the entities in the text, word visualizations and the like. Another output could be automatically-created maps – not just maps of any of the above entities, but small-multiple maps that would locate a variable (say, siege duration) across four different theaters, and then another set of small-multiple maps that would similarly map the same variable by year instead. Might as well have it make a heat map while you’re at it. Several groups have already created web versions of some of these features (voyant-tools among them). But with your own code, you also end up with all these results in the code itself, which can be further analyzed with yet more code. Then, your code runs itself on a bunch of other documents, and includes comparisons between documents – which texts talk more about place X? This works for teaching as well as research. Imagine if you had a class where you assigned a source, had the students analyze it, and then put an interactive visualization of the document up on the screen to explore. This really wouldn’t be that hard – I already have almost all of the bits, and it’s just a question of chaining them all together. It will take a while to make sure the objects, logic and syntax are all copacetic, but hopefully it’ll be done in time for classes next fall.

If you’re not sure about diving into Python, I’d suggest you start by getting as many of your sources in digital form as possible. Scan, OCR, type. Then get yourself a decent text editor like Notepad++ or Text Wrangler/BBEdit and start learning regular expressions.

But the more historians we get writing Python code, the more history-specific code we can build off of. So let’s get started.

Where have you been all my life?

Seriously though. I’ve known about the concept of ‘regular expressions’ for years, but for some reason I never took the plunge. And now that I have, my mind is absolutely blown away. Remember all those months in grad school (c. 1998-2000) when I was OCRing, proofing and manually parsing thousands of letters into my Access database? Well I sure do.

Twenty years later, I now discover that I could’ve shaved literally months off that work, if only I’d adopted the regex way of manipulating text. I’ll blame it on the fact that “digital humanities” wasn’t even a thing back then – check out Google Ngram Viewer if you don’t believe me.

So let’s start at the beginning. Entry-level text editing is easy enough: you undoubtedly learned long ago that in a text program like Microsoft Word you can find all the dates in a document – say 3/15/1702 and 3/7/1703 and 7/3/1704 – using a wildcard search like 170^#, where ^# is the wildcard for any digit (number). That kind of search will return 1701 and 1702 and 1703… But you’ve also undoubtedly been annoyed when you next learn that you can’t actually modify all those dates, because the wildcard character will be replaced in your basic find-replace with a single character. So, for example, you could easily convert all the forward slashes into periods, because you simply replace every slash with a period. But you can’t turn a variety of dates (text strings, mind you, not actual date data types) from MM/DD/YYYY into YYYY.MM.DD, because you need wildcards to find all the digit variations (3/15/1702, 6/7/1703…), but you can’t keep those values found by wildcards when you try to move them into a different order. In the above example, trying to replace 170^# with 1704 will convert every year with 1704, even if it’s 1701 or 1702. So you can cycle through each year and each month, like I did, but that takes a fair amount of time as the number of texts grow. This inability to do smart find-replace is a crying’ shame, and I’ve gnashed many a tooth over this quandary.

Enter regular expressions, aka regex or grep. I won’t bore you with the basics of regex (there’s a website or two on that), but will simply describe it as a way to search for patterns in text, not just specific characters. Not only can you find patterns in text, but with features called back references and look-aheads/look-backs (collectively: “lookarounds”), you can retain those wildcard characters and manipulate the entire text string without losing the characters found by the wildcards. It’s actually pretty easy:

Read More…

The Summer of Digital

Yep, it’s been a computational summer. Composed mostly of reading up on all things digital humanities. (Battle book? What battle book?) Most concretely, that’s meant setting up a modest Digital History Lab for our department (six computers, book-microfilm-photo scanners, a Microsoft Surface Hub touch display, and various software), and preparing for a brand new Intro to Digital History course, slated to kick off in a few weeks.

I’ve always been computer-curious, but it wasn’t until this summer that I fully committed to my inner nerdiness, and dove into the recent shenanigans of “digital humanities.” Primarily this meant finally committing to GIS, followed by lots of textual analysis tools, and brushing up on my database skills. But I’ve even started learning Python and a bit more AppleScript, if you can believe it.

So, in future posts, I’ll talk a little less about Devonthink and a bit more about other tools that will allow me to explore early modern European military history in a whole new way.

Stay tuned…

Voyant-to-web also a success

In case you need proof, here’s a link (collocate) graph from Voyant tools, based off the text from the second volume of the English translation of the “French” Duke of Berwick’s memoirs published in 1779: Curious which words Berwick used most frequently, and which other words they tended to be used with/near? (Or his translator, in any case.) Click the link above and hopefully you’ll see something like this, but interactive:

Screenshot 2017-06-25 14.49.23.png

After you upload your text corpus in the web version of Voyant, you can then export any of the tools and embed it in your own website using an iframe (inline frame). Note that you can also click on any of the terms in the embedded web version and it will open up the full web version of Voyant, with the corpus pre-loaded. Something like this, but oh-so-much-more-interactive:

Screenshot 2017-06-25 14.47.08.png

Apparently the Voyant server keeps a copy of the text you upload – no idea how long the Voyant servers keep the text, but I guess we’ll find out. There’s also a VoyantServer option, which you install on your own computer, for faster processing and greater privacy.

Never heard of Voyant? Then you’d best get yourself some early modern sources in full text format and head on over to

Testing, testing

…my export test from Aeon 2 timeline software to the web. Preparing to teach a new Introduction to Digital History course in the fall, while overseeing the creation of a modest Digital History Lab, will make you dust off all sorts of old, half-baked projects.

So we just reacquainted ourselves with my old website, started in 1998-1999, a period which coincided with me procrastinating after returning from my dissertation research in the French, English and Dutch archives. I had allowed the website to go fallow (but running) since 2006 or so – funny how a full-time job will do that. So today we reconnected, remembered the right password, downloaded local copies to sync on a new computer (using Dreamweaver), and now we’re up and running again.

Hopefully I’ll be able to put up a bunch of timelines for my various courses on the site, so I can give the URLs to students, as well as pull up the timelines in the classroom, rather than lug in my laptop and hook it up to the projector. Manually creating timelines in Illustrator has been fun (for example), but it takes a long time to make each one, and the data isn’t exportable, searchable, or manipulable like CSV files are. Which might be useful, say, if you were getting back into databases. Once I get GIS under my belt, I might possibly put up some maps as well – to replace those old AutoCAD map files from 1997. Oh yeah, I should probably replace that circa 1999 homepage too:

Screenshot 2017-06-24 23.26.10.png

It seemed cool 18 years ago (to me, at least), but I’m told styles have changed since then.

And I’d totally forgotten about my attempt to create a website for EMEMHers circa 2002. Turns out I even posted a few items, like the data from Erik Lund’s Austrian generals, and a PDF of John Lynn & George Satterfield’s Guide to Early Modern Military Sources in Midwestern Research Libraries (back when the proximity of rare book rooms was critical). Most amusing is my page where I include a list of books that it’d be great to have digital copies of. Good times. Of course, it’s also kinda depressing to realize that I’m now a part of history.

Jumping back to the present, my first experiment merging the early 21st century with the late 20th century seems successful – a timeline of various events and individuals from the Crusades. A course which, FWIW, I’ll be teaching again this fall. So if you’re interested in seeing how the Aeon timeline software translates to the Internet, take a peek at The timeline is dynamic: scrolling, zooming, searching, collapsing ‘arcs’, and clicking on arrows for further details. I haven’t updated the data to take advantage of Aeon version 2 yet, nor have I connected all the people and events or included many notes. But feel free to send any corrections my way.

Next up: figuring out the Simile widget, which will allow a bit more customization. An interesting example of combining an interactive timeline and map is here. Throw in embedded widgets for family trees, maps (Google or otherwise), argument maps, and Voyant text analysis – now you’ve got yourself a historical toolkit worthy of the 21st century!

Looking back over my own experience with twenty years of Internet history, I’m reminded of that old Virginia Slims cigarette ad: “We’ve come a long way, baby.”


The Google Giveth and the Google Taketh Away

In other words, hopefully you’ve already downloaded all those tasty EMEMH works from Google Books, like I’ve warned. Because some of them are disappearing from Full View, as publishing companies (I’m guessing) pay Google some money to sell print copies on Amazon and elsewhere. (See, I knew my hoarding instincts and general obsessive-compulsiveness would come in handy.)

But all hope is not lost, for if you can still find interest EMEMH PDFs, Google Books has recently decided to include the OCRed text layer with the PDF download as well, which means they are searchable. Just don’t look too closely at the results…

As I was saying

somebody should let people know when there are museum exhibits on early modern military subjects.

I’ve been writing up personal summaries of our recent trip to Vienna-Salzburg-Munich (and sprinkling them with photos off the web, which are usually far better than what we can manage), lest the memories fade from view too quickly. Pursuant to this task, I started looking up a bunch of early modern artists’ works in Google image search. Concurrently, my RSS feed alerted me to Amy Herman’s Visual Intelligence, which I acquired and have been reading with interest. In a suitably artistic state of mind, I thought I’d look up the Frick Collection (where Herman worked), just to see what kind of museum it was. Turns out, it’s in New York City (a few hours from me), and has some early modern works. So on a further whim, keenly aware of the fortuitous timing that allowed us to see the ephemeral Feste feiern and Kaiser Karl V erobert Tunis exhibits in Vienna, I checked to see what special exhibits the Frick had coming up. And, lo and behold, I find this exhibit, starting July 12 and running through October 2: Watteau’s Soldiers: Scenes of Military Life in Eighteenth-Century France. The description of the exhibit:

It would be difficult to think of an artist further removed from the muck and misery of war than Jean-Antoine Watteau (1684–1721), who is known as a painter of amorous aristocrats and melancholy actors. And yet, early in his career, Watteau painted a number of scenes of military life. They were produced during one of the darkest chapters of France’s history, the War of the Spanish Succession (1701–14), but the martial glory on which most military painters trained their gaze held no interest for Watteau. Instead, he focused on the most prosaic aspects of war — marches, halts, and encampments. The resulting works show the quiet moments between the fighting, when soldiers could rest and daydream, smoke pipes and play cards.

Watteau's soldiers

Watteau’s soldiers chilling’ at a Valenciennes gate

Presented exclusively at The Frick Collection in the summer of 2016, Watteau’s Soldiers is the first exhibition devoted solely to these captivating pictures, introducing the artist’s engagement with military life to a larger audience while exploring his unusual working methods. Among the paintings, drawings, and prints will be four of the seven known military scenes — with the Frick’s own Portal of Valenciennes as the centerpiece — as well as the recently rediscovered Supply Train, which has never before been exhibited publicly in a museum. Also featured will be thirteen studies of soldiers in red chalk, many directly related to the paintings on view, as well as a selection of works by Watteau’s predecessors and followers, the Frick’s Calvary Camp by Philips Wouwerman among them.

An accompanying book by Anne L. Poulet Curatorial Fellow Aaron Wile, published by The Frick Collection in association with D Giles, Ltd., London, will be the first illustrated catalogue of all Watteau works related to military subjects.

So if you’ll be in the region this summer, make some time to check it out. I know I will. And if you can’t,  at least consider checking out the catalog. Hopefully it’ll explain why Watteau’s short career should be divided into “early” and “late” works.

More posts on the military art to come.



Short post on Vienna trip

Just got back from a two-week excursion to central Europe, with a quick turnaround for other familial obligations.

But lest you think I was merely reading Georg Scherer’s sermons at a Viennese café while drinking my Wiener Melange (more like eating apfelstruedel mit schlagobers and reading reports of yet another act of hate/terrorism/gun violence in the U.S.), I was actually hard at work, traipsing across the historical flotsam and jetsam of what once was the crown jewel of the Austro-Hungarian empire. But that’s for another time.

To tide you over, in case you’re in Vienna over the next couple of months, and are interested in all things Karl V, the Kunsthistorisches Museum has a top-floor exhibit on Charles V’s capture of Tunis in 1535. There are apparently some tapestries of his successful North African campaign in the Prado, but Vienna has the “cartoons” (the paintings which were the basis of the tapestries) currently on display.

Detail from Vermeyen cartoon of Tunis 1535 campaign

Detail from Vermeyen cartoon of Tunis 1535 campaign

For a brief (English-language) overview of the exhibit, you can look here.

The KHM also has a (German-language) catalog of the exhibit. Which makes me think there really should be some art museum listserv to alert interested parties to military history-themed exhibits. Though something like this might be a start.

What should historical research look like in an age of digital collaboration?

Historical research, as most of us know, has traditionally been a solitary practice. Even in this postmodern age of killa’ collabs and remixes with co-authors named feat., historians, by and large, are still a lonely bunch of recluses. Admittedly, one’s choice of subject has a lot to do with how crowded your subfield is. Unfortunately (or not?), I’ve rarely been in a position where I knew somebody else who was actively researching the same war as me (War of the Spanish Succession) and might want to look at the same sources. John Stapleton is the closest example from my grad school days, and he focuses on the war before “mine,” so we’ve given each other feedback and pointed each other to various sources from “our” respective wars over the years. In general, though, it’s been kinda lonely out here on the plains.

But the times they are a-changin’ and the prairie turf is being transformed into suburban subdivisions. The question is whether all these houses will follow a similar aesthetic, whether their architecture will reference each other, or whether the only communication between neighbors will consist of vague nods at the grocery store and heated arguments over how far their property line extends. (Thus far, subdivisions are still segregated into ethnic neighborhoods.)

If we look beyond the discipline of History, we’re told that it’s an age of collaboration (CEOs say they want their new employees to work effectively in teams) as well as the age of information overload (I believe that – my main Devonthink database has grown to 104,000 documents and 95 million words of text). Even the other kind of doctors are having a rethink. Now this whole Internet thing allows like-minded individuals to communicate and commiserate across the planet, and not just with their neighbor next door. “Global village” and all that. As a result, even historians have figured out that we can now find out if we’re alone in the universe or not – I assume everybody has Google Alerts set for their name and publication titles? This academic version of Google Street View certainly has certainly expanded my worldview. My one semi-regret is that, thanks to online dissertations, conference proceedings and even blogs, I now find out I was in the archives 10-15 years too early, and there are currently a bunch of people both American and Euro looking into the period – and by “bunch” I mean maybe 6-12. Even more reasons for making connections. Hmmm, someone should create a blog that allows EMEMH scholars to communicate with each other…

So how should historical research work in this interconnected digital age, in this global, digital village? In an age when the moderately-well-heeled scholar can accumulate scans of thousands of rare books and hundreds of archival volumes? The combination of collaboration and digitization has opened up a spectrum of possibilities, and it’s up to us to decide which are worth exploring. Here are some possibilities I see, stretching along a spectrum from sharing general ideas to swapping concrete primary sources (Roy Rosenzweig undoubtedly predicted all this twenty years ago):

  • Topic Sharing. The way it’s traditionally been done, in grad school, or if people meet up in the archives or at a conference or on fellowship. You let people know the specific topics you’re working on, and let it progress from there: “Oh, you’re working on X. Do you know about …? Have you checked out Y? You should really look at Z.” This has two advantages: first, it allows participants to keep the details of their research close to the vest, and more fruitfully, it allows the historiography to develop into a conversation rather than separate ships passing each other in the night – it’s such a waste when something gets published that really should have looked at X, Y or Z, but nobody suggested it. Or, perhaps peers studying the same period/place offered comment, but other potential-peers studying the same theme didn’t (or vice versa). Sharing subjects also forces people to acknowledge that they might not be the only person writing on topic X, and encourage them to consider whether they might want to divvy up topics rather than writing in ignorance of what others will be publishing, or already have written. Say, hypothetically, when one thinks they want to write a chapter about how the French viewed battle in the War of the Spanish Succession, and then discover that another scholar has already written about a thousand pages on the subject. So letting others know what you’re working on would be a start: type of history, subject (sieges? battles? operations? logistics?…), type of study (campaign narrative? commander biography? comparison of two different theaters?…), sides/countries (including languages of sources being used), and so on.
  • Feedback and advice. This requires longer and more sustained interaction, but is far more useful for all involved. I’m not convinced by the latest bestseller claiming that the crowd is always right, but crowdsourcing certainly gives a scholar a sense of how his/her ideas are being received, and what ideas a potential audience might like to read about in the first place.
  • Research assistance. Here, I would suggest, is where most historians are still living in the stone age, or more accurately, are on the cusp between the paper and digital ages. Most of our precious historical documents survive entombed within a single piece of paper(s), in an archive that may require significant costs and time to access. Depending on a government’s view of cultural patrimony and the opportunity for a marketable product, a subset of those documents have been transferred to the digital realm. But not many. This is where many historians need help, a topic which we’ve discussed many times before (as with this thread, which prompted the present post), and where collaboration and digitization offer potential solutions to the inaccessibility of so many primary sources.
    But there is a rather important catch: copyright. Archives and libraries (and publishers, of course) claim copyright over the documents under their care, and they frown upon the idea that information just wants to be free (ask Aaron Swartz):
    CAC copyright slipSo this puts a bit of a kink in attempts to create a Napster-style primary source swap meet – though I am getting a little excited just imagining a primary-source orgy like Napster was back in the day.
    Fortunately there are steps short ofscofflawery. Most of these revolve around the idea of improving the ‘finding aids’ historians use to target particular documents within the millions of possibilities. These range in scale from helping others plan a strategic bombing campaign, to serving as forward observer for a surgical strike:

    • A wish list of specific volumes/documents that somebody would like to look at. This could be as simple as having somebody who has the document(s) just check to see what it discusses, whether it’s worth consulting. This, of course, requires a bit more time and effort than simply sharing the PDF.
    • Or it might mean providing some metadata on the documents in a given volume. For example, I discovered in the archives that if the Blenheim Papers catalog says that Salisch’s letters to Marlborough in volume XYZ cover the period 1702-1711, and I’m studying the siege of Douai in 1710, it is a waste of one of my limited daily requests to discover that Salisch’s letters include one dated 1702, one from 1711, and the rest all on 1708. The ability to pinpoint specific documents would in itself be a boon: many archives have indexes and catalogs and inventories that give almost no idea of the individual documents. Not only would it save time, but it might also save money if you want to order copies of just a few documents rather than an entire volume.
    • Or, such assistance could be as involved as transcribing the meaty bits of a document. Useful for full text, though purists might harbor a lingering doubt about the fidelity of the transcription.
    • Or, it might mean running queries for others based off of your own database. I did that for a fellow scholar once, and if you’ve got something like Devonthink (or at least lots of full-text sources), it’s pretty easy and painless. Though if there are too many results, that starts to look a bit like doing someone else’s research for them.

Of course with all of these options, you have to worry about thunder being stolen, about trusting someone else to find what you are looking for, etc., etc. And there probably isn’t a good way to assuage that concern except through trust that develops over time. And trust is based on a sense of fairness: Andy’s questions about how to create a system of calculating non-monetary exchanges have bedeviled barter systems for a long time, I think.

As usual, I don’t have a clear answer. Simple sharing of documents is undoubtedly the easiest solution (cheapest, quickest, fewest number of eyes between the original source and your interpretation), but I don’t have a system for the mechanics. Nor am I clear on the ethical issues of massive sharing of sources – is “My thanks to X for this source” in a footnote enough? If some documents are acquired with grant funds, can they be freely given away? And the list goes on…


Bird’s-eye view of the French Wars of Religion

Recently finished up three days on the French Wars of Religion in my Religion, War and Peace course, which means I can now post this old graphic summary of the wars. It almost makes sense of those crazy conflicts. Almost.

Can't we all just get along?

Can’t we all just get along?

This is probably my favorite time chart, aesthetically at least, but feel free to provide corrections or comments. Tons of gory detail, but I think you can also see the big picture as well.

Here’s an abbreviated version I put in the margin of my Powerpoint slides:

For the masses

For the masses